Learning by Reconstruction Produces Uninformative Features For Perception
Randall Balestriero, Yann LeCun
TL;DR
The paper interrogates the assumption that reconstruction-based representations readily support perceptual tasks. Through a linear-algebraic analysis and extensive experiments, it shows that the subspace learned for reconstruction (top-variance) often misaligns with perception, while the perception-relevant features reside in a bottom subspace learned later during training, a phenomenon exacerbated by complexity, background, and resolution. It demonstrates that guiding learning with denoising (notably masking in MAEs) can improve alignment with perception without sacrificing reconstruction, while simple additive Gaussian noise generally does not; the authors provide closed-form expressions and an alignment metric to quantify and optimize this relationship. These insights offer practical guidance for noise-distribution design in reconstruction-based learning and suggest broader implications for improving representation learning across modalities.
Abstract
Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.
