Show simple item record

dc.contributor.advisorEisner, Jason
dc.creatorAndrews, Nicholas Oliver
dc.date.accessioned2019-04-15T03:49:55Z
dc.date.created2016-05
dc.date.issued2016-02-22
dc.date.submittedMay 2016
dc.identifier.urihttp://jhir.library.jhu.edu/handle/1774.2/60650
dc.description.abstractLearning from unlabeled data is a long-standing challenge in machine learning. A principled solution involves modeling the full joint distribution over inputs and the latent structure of interest, and imputing the missing data via marginalization. Unfortunately, such marginalization is expensive for most non-trivial problems, which places practical limits on the expressiveness of generative models. As a result, joint models often encode strict assumptions about the underlying process such as fixed-order Markovian assumptions and employ simple count-based features of the inputs. In contrast, conditional models, which do not directly model the observed data, are free to incorporate rich overlapping features of the input in order to predict the latent structure of interest. It would be desirable to develop expressive generative models that retain tractable inference. This is the topic of this thesis. In particular, we explore joint models which relax fixed-order Markov assumptions, and investigate the use of recurrent neural networks for automatic feature induction in the generative process. We focus on two structured prediction problems: (1) imputing labeled segmentions of input character sequences, and (2) imputing directed spanning trees relating strings in text corpora. These problems arise in many applications of practical interest, but we are primarily concerned with named-entity recognition and cross-document coreference resolution in this work. For named-entity recognition, we propose a generative model in which the observed characters originate from a latent non-Markov process over words, and where the characters are themselves produced via a non-Markov process: a recurrent neural network (RNN). We propose a sampler for the proposed model in which sequential Monte Carlo is used as a transition kernel for a Gibbs sampler. The kernel is amenable to a fast parallel implementation, and results in fast mixing in practice. For cross-document coreference resolution, we move beyond sequence modeling to consider string-to-string transduction. We stipulate a generative process for a corpus of documents in which entity names arise from copying---and optionally transforming---previous names of the same entity. Our proposed model is sensitive to both the context in which the names occur as well as their spelling. The string-to-string transformations correspond to systematic linguistic processes such as abbreviation, typos, and nicknaming, and by analogy to biology, we think of them as mutations along the edges of a phylogeny. We propose a novel block Gibbs sampler for this problem that alternates between sampling an ordering of the mentions and a spanning tree relating all mentions in the corpus.
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.publisherJohns Hopkins University
dc.subjectnatural language processing
dc.subjectmachine learning
dc.subjectneural network
dc.subjectmcmc
dc.titleGenerative Non-Markov Models for Information Extraction
dc.typeThesis
thesis.degree.disciplineComputer Science
thesis.degree.grantorJohns Hopkins University
thesis.degree.grantorWhiting School of Engineering
thesis.degree.levelDoctoral
thesis.degree.namePh.D.
dc.date.updated2019-04-15T03:49:55Z
dc.type.materialtext
thesis.degree.departmentComputer Science
local.embargo.lift2020-05-01
local.embargo.terms2020-05-01
dc.contributor.committeeMemberDredze, Mark
dc.contributor.committeeMemberVan Durme, Benjamin
dc.publisher.countryUSA
dc.creator.orcid0000-0002-6097-9164


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record