The geometry of invariant learning: an information-theoretic analysis of data augmentation and generalization
Abdelali Bouyahia, Frédéric LeBlanc, Mario Marchand
TL;DR
The paper develops an information-theoretic framework to analyze data augmentation and invariance learning, modeling augmentation as a distribution over transformations and introducing an orbit-averaged loss to study generalization. It derives generalization bounds that decompose the gap into distribution shift, orbit-level mutual information, and augmentation-induced variability, all controlled by the group diameter $\Delta_{\mathcal{G}}$. A tighter per-sample bound further refines the analysis by decomposing MI into per-example and per-augmentation terms, enabling targeted augmentation strategies and invariance objectives. Experiments on MNIST/FashionMNIST validate that moderate augmentations reduce information leakage and correlate with improved generalization, while overly strong augmentations increase distribution shift and degrade performance. The work provides a principled geometric view of augmentation and offers practical guidance for balancing fidelity to the data with invariance-inducing regularization.
Abstract
Data augmentation is one of the most widely used techniques to improve generalization in modern machine learning, often justified by its ability to promote invariance to label-irrelevant transformations. However, its theoretical role remains only partially understood. In this work, we propose an information-theoretic framework that systematically accounts for the effect of augmentation on generalization and invariance learning. Our approach builds upon mutual information-based bounds, which relate the generalization gap to the amount of information a learning algorithm retains about its training data. We extend this framework by modeling the augmented distribution as a composition of the original data distribution with a distribution over transformations, which naturally induces an orbit-averaged loss function. Under mild sub-Gaussian assumptions on the loss function and the augmentation process, we derive a new generalization bound that decompose the expected generalization gap into three interpretable terms: (1) a distributional divergence between the original and augmented data, (2) a stability term measuring the algorithm dependence on training data, and (3) a sensitivity term capturing the effect of augmentation variability. To connect our bounds to the geometry of the augmentation group, we introduce the notion of group diameter, defined as the maximal perturbation that augmentations can induce in the input space. The group diameter provides a unified control parameter that bounds all three terms and highlights an intrinsic trade-off: small diameters preserve data fidelity but offer limited regularization, while large diameters enhance stability at the cost of increased bias and sensitivity. We validate our theoretical bounds with numerical experiments, demonstrating that it reliably tracks and predicts the behavior of the true generalization gap.
