Information-Theoretic Foundations for Machine Learning
Hong Jun Jeon, Benjamin Van Roy
TL;DR
The paper advances an information-theoretic, Bayesian framework for understanding learning by linking predictive performance to mutual information and rate-distortion bounds. It develops a general machinery to bound the irreducible and reducible components of estimation error across iid, sequential, meta-learning, and misspecified settings, with concrete results for linear, logistic, deep, and nonparametric models. By modeling data streams with latent parameters and using rate-distortion theory, it provides explicit bounds that scale with problem dimensions, data size, and model complexity, offering a principled way to anticipate sample efficiency and compute-coverage tradeoffs. The framework illuminates how information flows from data to predictions, informs bounds for minimax and meta-learning scenarios, and connects learning performance to optimal lossy compression, thereby guiding future theory and algorithm design in diverse ML contexts.
Abstract
The progress of machine learning over the past decade is undeniable. In retrospect, it is both remarkable and unsettling that this progress was achievable with little to no rigorous theory to guide experimentation. Despite this fact, practitioners have been able to guide their future experimentation via observations from previous large-scale empirical investigations. In this work, we propose a theoretical framework which attempts to provide rigor to existing practices in machine learning. To the theorist, we provide a framework which is mathematically rigorous and leaves open many interesting ideas for future exploration. To the practitioner, we provide a framework whose results are simple, and provide intuition to guide future investigations across a wide range of learning paradigms. Concretely, we provide a theoretical framework rooted in Bayesian statistics and Shannon's information theory which is general enough to unify the analysis of many phenomena in machine learning. Our framework characterizes the performance of an optimal Bayesian learner as it learns from a stream of experience. Unlike existing analyses that weaken with increasing data complexity, our theoretical tools provide accurate insights across diverse machine learning settings. Throughout this work, we derive theoretical results and demonstrate their generality by apply them to derive insights specific to settings. These settings range from learning from data which is independently and identically distributed under an unknown distribution, to data which is sequential, to data which exhibits hierarchical structure amenable to meta-learning, and finally to data which is not fully explainable under the learner's beliefs (misspecification). These results are particularly relevant as we strive to understand and overcome increasingly difficult machine learning challenges in this endlessly complex world.
