STREAMING CORESETS FOR HIGH-DIMENSIONAL GEOMETRY
Johns Hopkins University
This thesis studies clustering problems on data streams, specifically with applications to metric spaces that are either non-Euclidean or of high-dimensionality. The algorithms are introduced to approximate minima of the k-median function, although they can be used to find approximate minima of many other functions are shown in Part I. In Part I, we introduce the first analysis of the effect of stream order on the space required to solve k-median. We use this to forge a connection to random-order streams and provide an optimal O(nk)-time algorithm for k-median in the RAM model. In Part II, we provide a nearly optimal algorithm to maintain a coreset for k-median. Specifically for the case of k-means, we introduce the first coreset to be simultaneously independent of both the dimension and the input size. This gives the first sublinear size coreset for high- dimensional sparse data. In Part III, we apply our techniques to streams where points may be deleted as well as inserted, and construct the first coreset for high-dimensional spaces.