Table of Contents
Fetching ...

The geometry of invariant learning: an information-theoretic analysis of data augmentation and generalization

Abdelali Bouyahia, Frédéric LeBlanc, Mario Marchand

TL;DR

The paper develops an information-theoretic framework to analyze data augmentation and invariance learning, modeling augmentation as a distribution over transformations and introducing an orbit-averaged loss to study generalization. It derives generalization bounds that decompose the gap into distribution shift, orbit-level mutual information, and augmentation-induced variability, all controlled by the group diameter $\Delta_{\mathcal{G}}$. A tighter per-sample bound further refines the analysis by decomposing MI into per-example and per-augmentation terms, enabling targeted augmentation strategies and invariance objectives. Experiments on MNIST/FashionMNIST validate that moderate augmentations reduce information leakage and correlate with improved generalization, while overly strong augmentations increase distribution shift and degrade performance. The work provides a principled geometric view of augmentation and offers practical guidance for balancing fidelity to the data with invariance-inducing regularization.

Abstract

Data augmentation is one of the most widely used techniques to improve generalization in modern machine learning, often justified by its ability to promote invariance to label-irrelevant transformations. However, its theoretical role remains only partially understood. In this work, we propose an information-theoretic framework that systematically accounts for the effect of augmentation on generalization and invariance learning. Our approach builds upon mutual information-based bounds, which relate the generalization gap to the amount of information a learning algorithm retains about its training data. We extend this framework by modeling the augmented distribution as a composition of the original data distribution with a distribution over transformations, which naturally induces an orbit-averaged loss function. Under mild sub-Gaussian assumptions on the loss function and the augmentation process, we derive a new generalization bound that decompose the expected generalization gap into three interpretable terms: (1) a distributional divergence between the original and augmented data, (2) a stability term measuring the algorithm dependence on training data, and (3) a sensitivity term capturing the effect of augmentation variability. To connect our bounds to the geometry of the augmentation group, we introduce the notion of group diameter, defined as the maximal perturbation that augmentations can induce in the input space. The group diameter provides a unified control parameter that bounds all three terms and highlights an intrinsic trade-off: small diameters preserve data fidelity but offer limited regularization, while large diameters enhance stability at the cost of increased bias and sensitivity. We validate our theoretical bounds with numerical experiments, demonstrating that it reliably tracks and predicts the behavior of the true generalization gap.

The geometry of invariant learning: an information-theoretic analysis of data augmentation and generalization

TL;DR

The paper develops an information-theoretic framework to analyze data augmentation and invariance learning, modeling augmentation as a distribution over transformations and introducing an orbit-averaged loss to study generalization. It derives generalization bounds that decompose the gap into distribution shift, orbit-level mutual information, and augmentation-induced variability, all controlled by the group diameter . A tighter per-sample bound further refines the analysis by decomposing MI into per-example and per-augmentation terms, enabling targeted augmentation strategies and invariance objectives. Experiments on MNIST/FashionMNIST validate that moderate augmentations reduce information leakage and correlate with improved generalization, while overly strong augmentations increase distribution shift and degrade performance. The work provides a principled geometric view of augmentation and offers practical guidance for balancing fidelity to the data with invariance-inducing regularization.

Abstract

Data augmentation is one of the most widely used techniques to improve generalization in modern machine learning, often justified by its ability to promote invariance to label-irrelevant transformations. However, its theoretical role remains only partially understood. In this work, we propose an information-theoretic framework that systematically accounts for the effect of augmentation on generalization and invariance learning. Our approach builds upon mutual information-based bounds, which relate the generalization gap to the amount of information a learning algorithm retains about its training data. We extend this framework by modeling the augmented distribution as a composition of the original data distribution with a distribution over transformations, which naturally induces an orbit-averaged loss function. Under mild sub-Gaussian assumptions on the loss function and the augmentation process, we derive a new generalization bound that decompose the expected generalization gap into three interpretable terms: (1) a distributional divergence between the original and augmented data, (2) a stability term measuring the algorithm dependence on training data, and (3) a sensitivity term capturing the effect of augmentation variability. To connect our bounds to the geometry of the augmentation group, we introduce the notion of group diameter, defined as the maximal perturbation that augmentations can induce in the input space. The group diameter provides a unified control parameter that bounds all three terms and highlights an intrinsic trade-off: small diameters preserve data fidelity but offer limited regularization, while large diameters enhance stability at the cost of increased bias and sensitivity. We validate our theoretical bounds with numerical experiments, demonstrating that it reliably tracks and predicts the behavior of the true generalization gap.
Paper Structure (32 sections, 21 theorems, 109 equations, 2 figures)

This paper contains 32 sections, 21 theorems, 109 equations, 2 figures.

Key Result

Theorem 3.1

Suppose $\ell(w,Z)$ is $R$-sub-Gaussian under $Z \sim \mathcal{D}$ for every $w \in \mathcal{W}$, then for any learning algorithm characterized by $P_{W|S}$ such that $S \sim \mathcal{D}^m$, we have

Figures (2)

  • Figure 1: Information-theoretic generalization bound components under varying augmentation conditions. Top row: Evolution of the distribution shift (KL-divergence), orbit-averaged mutual information, augmentation mutual information, and total bound (scaled by $R$) as a function of augmentation variance $t^2$. Increasing augmentation strength amplifies distribution shift and total bound while reducing orbit-averaged mutual information. Bottom row: Dependence of the same four components on the number of augmentations per sample $n$, for different training set sizes $m$. The augmentation mutual information term decays rapidly with $n$, leading to a tighter overall bound for larger $n$ and m.
  • Figure 2: Empirical evaluation of the generalization bound in Theorem \ref{['th:bu_like_bound_mi']} on MNIST and FashionMNIST with affine augmentations. Specifically, the transformation transforms.RandomAffine(degrees = strength × 10, translate = (strength, strength)) randomly rotates each image within ±(10 × strength) degrees and translates it by up to a fraction strength of the image width and height. (a, c) Estimated bound versus empirical generalization gap as a function of augmentation level. The bound increases consistently with transformation strength and closely follows the observed generalization behavior. (b, d) Contribution of individual terms; KL-divergence, orbit-averaged mutual information, and per-example augmentation mutual information; showing that the KL-divergence term dominates while mutual information components remain relatively stable across augmentation levels.

Theorems & Definitions (36)

  • Definition 3.1
  • Definition 3.2
  • Theorem 3.1: xu2017information
  • Theorem 3.2: bu2020tightening
  • Definition 4.1
  • Definition 4.2
  • Theorem 5.1
  • Theorem 5.2
  • Corollary 5.1
  • Corollary 5.2
  • ...and 26 more