A Study Of The Mathematics Of Deep Learning
Johns Hopkins University
"Deep Learning"/"Deep Neural Nets" is a technological marvel that is now increasingly deployed at the cutting-edge of artificial intelligence tasks. This ongoing revolution can be said to have been ignited by the iconic 2012 paper from the University of Toronto titled ``ImageNet Classification with Deep Convolutional Neural Networks'' by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. This paper showed that deep nets can be used to classify images into meaningful categories with almost human-like accuracies! As of 2020 this approach continues to produce unprecedented performance for an ever widening variety of novel purposes ranging from playing chess to self-driving cars to experimental astrophysics and high-energy physics. But this new found astonishing success of deep neural nets in the last few years has been hinged on an enormous amount of heuristics and it has turned out to be extremely challenging to be mathematically rigorously explainable. In this thesis we take several steps towards building strong theoretical foundations for these new paradigms of deep-learning. Our proofs here can be broadly grouped into three categories, 1. Understanding Neural Function Spaces We show new circuit complexity theorems for deep neural functions over real and Boolean inputs and prove classification theorems about these function spaces which in turn lead to exact algorithms for empirical risk minimization for depth 2 ReLU nets. We also motivate a measure of complexity of neural functions and leverage techniques from polytope geometry to constructively establish the existence of high-complexity neural functions. 2. Understanding Deep Learning Algorithms We give fast iterative stochastic algorithms which can learn near optimal approximations of the true parameters of a $\relu$ gate in the realizable setting. (There are improved versions of this result available in our papers https://arxiv.org/abs/2005.01699 and https://arxiv.org/abs/2005.04211 which are not included in the thesis.) We also establish the first ever (a) mathematical control on the behaviour of noisy gradient descent on a ReLU gate and (b) proofs of convergence of stochastic and deterministic versions of the widely used adaptive gradient deep-learning algorithms, RMSProp and ADAM. This study also includes a first-of-its-kind detailed empirical study of the hyper-parameter values and neural net architectures when these modern algorithms have a significant advantage over classical acceleration based methods. 3. Understanding The Risk Of (Stochastic) Neural Nets We push forward the emergent technology of PAC-Bayesian bounds for the risk of stochastic neural nets to get bounds which are not only empirically smaller than contemporary theories but also demonstrate smaller rates of growth w.r.t increase in width and depth of the net in experimental tests. These critically depend on our novel theorems proving noise resilience of nets. This work also includes an experimental investigation of the geometric properties of the path in weight space that is traced out by the net during the training. This leads us to uncover certain seemingly uniform and surprising geometric properties of this process which can potentially be leveraged into better bounds in future.
Deep Learning Neural Networks Stochastic Optimization Function Approximation Sparse Coding PAC-Bayes