Dissimilarity Learning under Noise: ClassicalMultidimensional Scaling and Randomer Forests
MetadataShow full item record
This dissertation mainly concerns with learning from dissimilarities. That is, instead of features, we only observe dissimilarities between pairs of objects for inference. This dissertation is composed with three parts regarding using classical multidimensional scaling under the noise, using Randomer Forest to denoise the datasets and how we can utilize dissimilarity for the task of graph matching. Classical multidimensional scaling (CMDS) is a widely used method in dimensionality reduction and manifold learning. The method takes in a dissimilarity matrix and outputs a low-dimensional configuration matrix based on a spectral decomposition. In this dissertation, we present three noise models and analyze the resulting configuration matrices, or embeddings. In particular, we show that under each of the three noise models the resulting embedding gives rise to a central limit theorem. We also provide compelling simulations and real data illustrations of these central limit theorems. This perturbation analysis represents a significant advancement over previous results regarding classical multidimensional scaling behavior under randomness. CMDS is a special case of a more general procedure called Manifold Learning, which is essentially required to achieve quality inferences for modern high-dimensional datasets. Many manifold learning methods have been proposed, each with their own advantages and disadvantages. In this dissertation, building off recent advances in supervised learning, we modify the leading supervised decision forest method to support unsupervised learning, and therefore also nonlinear manifold learning. The key differentiator between our Unsupervised Randomer Forest (URerf) and other manifold learning techniques is that URerF operates on low-dimensional sparse linear combinations of features, rather than either the full observed dimensionality, or one-dimensional marginals. We quantify the efficacy of URerF by computing precision-recall curves relative to the true latent manifold or class label (when it is known). Empirical results on simulated data demonstrate that URerF is robust to high-dimensional noise, where as other methods, such as Isomap and UMAP, quickly deteriorate in such settings.