Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
MetadataShow full item record
This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias. Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations) is selected. The latter is far superior, but surprisingly few annotated examples are required. The experimentation focuses on a single dependency grammar induction task, in depth. The aim is to give strong support for the usefulness of the new techniques in one scenario. It must be noted, however, that the task (as defined here and in prior work) is somewhat artificial, and improved performance on this particular task is not a direct contribution to the greater field of natural language processing. The real problem the task seeks to simulate—the induction of syntactic structure in natural language text—is certainly of interest to the community, but this thesis does not directly approach the problem of exploiting induced syntax in applications. We also do not attempt any realistic simulation of ii human language learning, as our newspaper text data do not resemble the data encountered by a child during language acquisition. Further, our iterative learning algorithms assume a fixed batch of data that can be repeatedly accessed, not a long stream of data observed over time in tandem with acquisition. (Of course, the cognitive criticisms apply to virtually all existing learning methods in natural language processing, not just the new ones presented here.) Nonetheless, the novel estimation methods presented are, we will argue, better suited to adaptation for real engineering tasks than the maximum likelihood baseline. Our new methods are shown to achieve significant improvements over maximum likelihood estimation and maximum a posteriori estimation, using the EM algorithm, for a state-of-the-art probabilistic model used in dependency grammar induction (Klein and Manning, 2004). The task is to induce dependency trees from part-of-speech tag sequences; we follow standard practice and train and test on sequences of ten tags or fewer. Our results are the best published to date for six languages, with supervised model selection: English (improvement from 41.6% directed attachment accuracy to 66.7%, a 43% relative error rate reduction), German (54.4 ! 71.8%, a 38% error reduction), Bulgarian (45.6% ! 58.3%, a 23% error reduction), Mandarin (50.0% ! 58.0%, a 16% error reduction), Turkish (48.0% ! 62.4%, a 28% error reduction, but only 2% error reduction from a left-branching baseline, which gives 61.8%), and Portuguese (42.5%!71.8%, a 51% error reduction). We also demonstrate the success of contrastive estimation at learning to disambiguate partof- speech tags (from unannotated English text): 78.0% to 88.7% tagging accuracy on a known-dictionary task (a 49% relative error rate reduction), and 66.5% to 78.4% on a more difficult task with less dictionary knowledge (a 35% error rate reduction). The experiments presented in this thesis give one of the most thorough explorations to date of unsupervised parameter estimation for models of discrete structures. Two sides of the problem are considered in depth: the choice of objective function to be optimized during training, and the method of optimizing it. We find that both are important in unsupervised learning. Our best results on most of the six languages involve both improved objectives and improved search. The methods presented in this thesis were originally presented in Smith and Eisner (2004, 2005a,b, 2006). The thesis gives a more thorough exposition, relating the methods to other work, presents more experimental results and error analysis, and directly compares the methods to each other.