Machine learning for NLP
Introduction
This is a reading course for PhD students at GU and Chalmers. The course is coordinated by Richard Johansson.
The course starts by an introductory seminar led by RJ. Then follows a reading period, after which each participant will give a seminar and turn in a paper. (Late spring?)
Examination
To pass, each participant has to turn in a paper and present it at a seminar.
Meetings
Date | Topic | Location | Speaker |
---|---|---|---|
28 January, 13.00 | Introduction (slides, code) | L307 | RJ |
Literature
- Hal Daumé, A Course in Machine Learning. This book seems to be an ongoing project.
- If you want to go deep into the theory, then we have Hastie, Tibshirani and Friedman: The Elements of Statistical Learning.
- Joakim Nivre's course on machine learning for NLP, including slides and recorded lectures, is available here.
Reading list
Naturally, you will probably focus on a subset of these ;)
Document classifiers
- Most papers are variations of this idea: Joachims., Text Categorization with Support Vector Machines: Learning with Many Relevant Features, tech report from the University, of Dortmund, 1997.
- Here is an example of such a variation: Pang and Lee, Thumbs up? Sentiment Classification using Machine Learning Techniques, EMNLP 2002.
- A recent example: Günther, Sentiment Analysis of Microblogs, Master Thesis in Language Technology, GU, 2013.
Classifiers with linguistic features
In this kind of work, we would typically take some classification algorithm off the shelf and then define a set of linguistically well-motivated features. Most of the intellectual effort goes into breaking down the task into smaller decisions so that a classifier can be applied, and then designing the features.
Here is a list of good examples of this genre. When reading, just keep in mind that these papers are sometimes a bit old, and the actual machine learning / parameter estimation content might have aged, so don't spend too much effort on that.
- Soon et al., A Machine Learning Approach to Coreference Resolution of Noun Phrases, Computational Linguistics, 27(4), 2001.
- Gildea and Jurafsky, Automatic Labeling of Semantic Roles, Computational Linguistics, 28(3), 2002. For something less medieval on a similar topic, see Pradhan et al. Towards Robust Semantic Role Labeling, Computational Linguistics, 34(2), 2008.
- Nivre et al., Memory-based Dependency Parsing, CoNLL 2004.
Structured prediction
In most of these cases (in particular when using the perceptron) the actual machine learning content isn't that complicated. What makes these approaches a bit more complex is that the algorithms must make use of the structure of the problem.
- McDonald et al., Online Large-Margin Training of Dependency Parsers, ACL 2005. The technical report gives more details.
- The following paper is the basis for much NLP in the 2000s: Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, ACL 2002.
- Here are two classic and influential papers that would be read
mostly for historical reasons.
- Lafferty et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2001. For something more modern on CRFs, there are overview articles such as this one, which is also one of the few including a discussion on how to handle features in structure prediction.
- Taskar et al., Max-Margin Markov Networks, NIPS 2003. Like in the CRF paper, the estimation part has aged a bit.
- A recent example: Zhang et al. Online Learning for Inexact Hypergraph Search, EMNLP 2013.
- Sometimes the conventional wisdom is wrong, and greedy classifiers may perform better than or nearly as well as structured learners. Here is an example: Ratinov and Roth: Design Challenges and Misconceptions in Named Entity Recognition, CoNLL 2009.
Bayesian learning
The math can be a bit heavy in these papers, so I'd recommend starting with Resnik and Hardisty if you want to implement these models.
- The original paper on LDA topic modeling: Blei et al., Latent Dirichlet allocation, JMLR, 2003. Here is another overview paper: Blei and Lafferty, Topic models, in Srivastava and Sahami (eds) Text Mining: Classification, Clustering, and Applications, Chapman & Hall, 2009.
- For an introduction to how to actually program with Bayesian models, see Resnik and Hardisty Gibbs sampling for the uninitiated, tech report from the University of Maryland, 2010. Here is a similar description of LDA: Heinrich, Parameter estimation for text analysis, tech report, 2008.
- An entertaining introduction to more advanced Bayesian models: Knight, Bayesian Inference with Tears, 2009.
Semisupervised learning, domain adaptation, and such
- Expectation–Maximization is still at the heart of many unsupervised and semisupervised approaches. An example of the countless EM descriptions, which also covers some preliminaries: Collins' lecture notes.
-
One of the simplest approaches to semi-supervised learning is to
compute some general meaning representation (e.g. clusters,
vectors) on large, unlabeled corpora, and use them
as features in a normal supervised learner. Here are a
couple of examples.
- Koo et al., Simple Semi-supervised Dependency Parsing, ACL 2008.
- Turian et al., Word Representations: A Simple and General Method for Semi-Supervised Learning, ACL 2010.
- Neurally inspired learning methods have seen much attention lately; one of the reasons for the excitement is that they can flexibly integrate unlabeled data and multiple learning tasks. Turian above is one example; here is another: Socher et al. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions, EMNLP 2011.
- Domain adaptation methods can be divided into those that have
unlabeled target data only, and those that have a small amount of
target data.
- Example, no target data: Blitzer et al., Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification, ACL 2007.
- Example, little target data: Daumé III, Frustratingly Easy Domain Adaptation, ACL 2007.
- An alternative to domain adaptation is to add noise during training, in order to make the result more robust: Søgaard and Johannsen, Robust learning in random subspaces: equipping NLP for OOV effects, Coling 2012.
Links to software libraries
- scikit-learn is an actively developed Python library for classification and clustering.
- NLTK also has some minimal support of machine learning support.
- gensim is a Python library for unsupervised learning algorithms such as LDA.
- Weka is probably the most well-known ML library for Java. It includes classification and clustering algorithms and some feature selection, and has an interactive UI.
- Mallet is another Java library. This is more NLP-oriented than Weka and includes CRF and LDA.
- LIBLINEAR is a very fast C and Java library for support vector machines and logistic regression. There is also a (slower) nonlinear counterpart called LIBSVM.
- Léon Bottou's CRF implementation is the fastest I've found.
- Examples of machine learning libraries designed for very large-scale problems include Mahout and Graphlab.