Topic Modeling with Structured Priors for Text-Driven Science

Paul, Michael John

Topic Modeling with Structured Priors for Text-Driven Science

Files

PAUL-DISSERTATION-2015.pdf (2.69 MB)

Date

2015-07-24

Authors

Paul, Michael John

Publisher

Johns Hopkins University

Abstract

Many scientific disciplines are being revolutionized by the explosion of public data on the web and social media, particularly in health and social sciences. For instance, by analyzing social media messages, we can instantly measure public opinion, understand population behaviors, and monitor events such as disease outbreaks and natural disasters. Taking advantage of these data sources requires tools that can make sense of massive amounts of unstructured and unlabeled text. Topic models, statistical models that posit low-dimensional representations of data, can uncover interesting latent structure in large text datasets and are popular tools for automatically identifying prominent themes in text. For example, prominent themes of discussion in social media might include politics and health. To be useful in scientific analyses, topic models must learn interpretable patterns that accurately correspond to real-world concepts of interest. This thesis will introduce topic models that can encode additional structures such as factorizations, hierarchies, and correlations of topics, and can incorporate supervision and domain knowledge. For example, topics about elections and Congressional legislation are related to each other (as part of a broader topic of “politics”), and certain political topics have partisan associations. These types of relations between topics can be modeled by formulating the Bayesian priors over parameters as functions of underlying “components,” which can be constrained in various ways to induce different structures. This approach is first introduced through a topic model called factorial LDA, which models a factorized structure in which topics are conceptually arranged in multiple dimensions. Factorial LDA can be used to model multiple types of information, for example topic and political ideology. We then introduce a family of structured-prior topic models called SPRITE, which creates a unifying representation that generalizes factorial LDA as well as other existing topic models, and creates a powerful framework for building new models. This thesis will also show how these topic models can be used in various scientific applications, such as extracting medical information from forums, measuring healthcare quality from patient reviews, and monitoring public opinion in social media.

Keywords

machine learning, topic modeling, natural language processing, public health

URI

http://jhir.library.jhu.edu/handle/1774.2/39571

Collections

ETD -- Doctoral Dissertations

Full item page