The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective
Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar
TL;DR
The paper develops a unified spectral framework to understand data augmentation (DA) in linear models across under- and overparameterized regimes, addressing both regression and classification. It shows that DA induces two implicit regularizers: (i) a data-determined modification of the data covariance spectrum and (ii) an explicit ridge-like boost that stabilizes training, with the overall effect captured by an augmentation-transformed covariance ${\boldsymbol\Sigma}_{aug}$. By connecting the augmented empirical risk to ridge regression, the authors derive non-asymptotic MSE and POE bounds governed by effective ranks $\rho_k^{aug}$ and $R_k^{aug}$, and they demonstrate how different augmentations (e.g., Gaussian noise, masking, cutout, salt-and-pepper, random rotation) reshape the spectrum to yield good, bad, or ugly generalization outcomes. Through case studies, corollaries, and experiments, the work reveals when DA reduces variance without inflating bias (beneficial, especially classification or underparameterized regression) and when it introduces harmful distribution shifts or isotropization (ugly/bad, notably in overparameterized regression). The framework also distinguishes augmentation application modes (precomputed vs on-the-fly) and proposes a new random-rotation augmentation with strong theoretical and empirical performance. Overall, the results offer a principled design blueprint for DA strategies in practical learning systems and lay groundwork for extending the analysis to nonlinear and self-supervised settings.
Abstract
Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between over-parameterized and under-parameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.
