Matching and Inference for Multiple Correlated Data Sets

Embargo until
Journal Title
Journal ISSN
Volume Title
Johns Hopkins University
Given multiple correlated data sets, an important question is how to make use of them to benefit later statistical inference. This is a realistic setting in the modern world as more and more related data sets are collected, say images and their descriptions, articles in multiple languages, actors in multiple social networks; and real data are often multivariate or high-dimensional such that dimension reduction is necessary before any inference. In this dissertation, I consider three dimension reduction and matching methods, namely principal component analysis followed by Procrustes matching, canonical correlation analysis, and nonlinear matching using shortest-path distance and joint neighborhood. I investigate their theoretical properties and their impact on later inference using the Procrustes fitting error, classification error, and hypothesis testing respectively. The main conclusion of this dissertation is that given a particular inference task for multiple correlated data sets, we may significantly improve the inference performance by joint matching and projection, compared to separate projection or omitting modalities. Numerical experiments are provided to illustrate the theorems and the methodology using simulated data and real data.
Dimension reduction, Machine learning, Data matching, Statistical inference