Taming Big Data By Streaming

Embargo until
Journal Title
Journal ISSN
Volume Title
Johns Hopkins University
Data streams have emerged as a natural computational model for numerous applications of big data processing. In this model, algorithms are assumed to have access to a limited amount of memory and can only make a single pass (or a few passes) over the data, but need to produce sufficiently accurate answers for some objective functions on the dataset. This model captures various real-world applications and stimulates new scalable tools for solving important problems in the big data era. This dissertation focuses on the following two aspects of the streaming model. 1. Understanding the capability of the streaming model. For a vector aggregation stream, i.e., when the stream is a sequence of updates to an underlying $n$-dimensional vector $v$ (for very large $n$), we establish nearly tight space bounds on streaming algorithms of approximating functions of the form $\sum_{i=1}^n g(v_i)$ for nearly all functions $g$ of one-variable and $l(v)$ for all symmetric norms $l$. These results provide a deeper understanding of the streaming computation model. 2. Tighter upper bounds. We provide better streaming $k$-median clustering algorithms in a dynamic points stream, i.e., a stream of insertion and deletion of points on a discrete Euclidean space ($[\Delta]^d$ for sufficiently large $\Delta$ and $d$). Our algorithms use $k\cdot\poly(d \log \Delta)$ space/update time and maintain with high probability an approximate $k$-median solution to the streaming dataset. All previous algorithms for computing an approximation for the $k$-median problem over dynamic data streams required space and update time exponential in $d$.
Theoretic Computer Science, Streaming Algorithm