On Model-Based Semi-Supervised Clustering

Embargo until
Journal Title
Journal ISSN
Volume Title
Johns Hopkins University
We consider an extension of model-based clustering to the semi-supervised case, where some of the data are pre-labeled. We provide a derivation of the Bayesian Information Criterion (BIC) approximation to the Bayes factor in this setting. We then use the BIC to the select number of clusters and the variables useful for clustering. We discuss some considerations for $O(1)$ terms in information criteria when performing model-based clustering. Next, we explore a novel method for the initialization of the EM algorithm for the semi-supervised case using modifications to the k-means++ algorithm to account for the labels. Then, we derive an improved theoretical bound on expected cost and observe improved performance in simulated and real data examples. This analysis provides theoretical justification for a typically linear time semi-supervised clustering algorithm. We show how this algorithms outperforms related semi-supervised k-means-style algorithms on several datasets. Finally, we demonstrate semi-supervised model based clustering with our improved k-means++ initialization on two applications. First, we identify behaviotypes in a fly larva dataset. Next, we nominate interesting vertices in graphs using two types of supervision.
Semi-superivsed clustering, k-means, clustering, k-means++, approximation algorithm, GMM