Table of Contents
Fetching ...

Learning by Reconstruction Produces Uninformative Features For Perception

Randall Balestriero, Yann LeCun

TL;DR

The paper interrogates the assumption that reconstruction-based representations readily support perceptual tasks. Through a linear-algebraic analysis and extensive experiments, it shows that the subspace learned for reconstruction (top-variance) often misaligns with perception, while the perception-relevant features reside in a bottom subspace learned later during training, a phenomenon exacerbated by complexity, background, and resolution. It demonstrates that guiding learning with denoising (notably masking in MAEs) can improve alignment with perception without sacrificing reconstruction, while simple additive Gaussian noise generally does not; the authors provide closed-form expressions and an alignment metric to quantify and optimize this relationship. These insights offer practical guidance for noise-distribution design in reconstruction-based learning and suggest broader implications for improving representation learning across modalities.

Abstract

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.

Learning by Reconstruction Produces Uninformative Features For Perception

TL;DR

The paper interrogates the assumption that reconstruction-based representations readily support perceptual tasks. Through a linear-algebraic analysis and extensive experiments, it shows that the subspace learned for reconstruction (top-variance) often misaligns with perception, while the perception-relevant features reside in a bottom subspace learned later during training, a phenomenon exacerbated by complexity, background, and resolution. It demonstrates that guiding learning with denoising (notably masking in MAEs) can improve alignment with perception without sacrificing reconstruction, while simple additive Gaussian noise generally does not; the authors provide closed-form expressions and an alignment metric to quantify and optimize this relationship. These insights offer practical guidance for noise-distribution design in reconstruction-based learning and suggest broader implications for improving representation learning across modalities.

Abstract

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.
Paper Structure (19 sections, 7 theorems, 42 equations, 9 figures)

This paper contains 19 sections, 7 theorems, 42 equations, 9 figures.

Key Result

Theorem 1

The loss function from eq:bilinear is minimized for where ${\bm{H}} \triangleq {\bm{D}}^{-\frac{1}{2}}_{{\bm{X}}{\bm{X}}^\top}{\bm{P}}^\top_{{\bm{X}}{\bm{X}}^\top}{\bm{A}}{\bm{P}}_{{\bm{X}}{\bm{X}}^\top} {\bm{D}}^{-\frac{1}{2}}_{{\bm{X}}{\bm{X}}^\top}$. (Proof in proof:linear_solution, empirical validation in fig:validation_general.)

Figures (9)

  • Figure 1: Features for reconstruction are uninformative for perception (top): TinyImagenet ResNet9 top-1 accuracy when trained and validated on images projected on the top-subspace (red) or bottom subspace (blue) of explained variance, corresponding images displayed in the middle and in \ref{['fig:images_pca']}. Perception features are learned last (bottom): training loss evolution (red to blue) of reconstructed training images from a deep Autoencoder projected onto the eigenspace of the original data (black). The top eigenspace (right) is learned first, and then, if training lasts long enough, the features most useful for perception (left) are finally learned. This explains why learning by performances on perception task keep increasing long after reconstructed samples look appealing.
  • Figure 2: Depiction of the closed form alignment measure from \ref{['eq:alignment']} measuring the minimum supervised training error achievable given the optimal reconstruction parameters, as per \ref{['thm:linear_solution', 'thm:alignment']}. Top: depiction in term of the latent dimension $K$ (x-axis). Bottom: depiction in term of the ratio of the latent dimension $K$ to the input dimension $D$. We clearly observe that as the dataset becomes more realistic (going from background-free images to CIFAR and then to TinyImagenet), as the alignment between the reconstruction and supervised task lessens. In particular, when going to TinyImagenet, we observe that the alignment only increases linearly with respect to the latent space dimension.
  • Figure 3: Reprise of \ref{['fig:teaser']} for additional autoencoder architectures: convolutional encoder and deconvolutional decoder ( top) and MLP encoder and decoder ( bottom). We clearly observe that the top subspace is learned first during training, which is the one that best minimize the reconstruction loss but that contains the least informative features for perception, as per \ref{['fig:classification']}.
  • Figure 4: We depict the classification accuracy of a ResNet9 DNN when trained and tested on images that have been projected onto the top ( red) and bottom ( blue) subspace as ordered per the eigenvalues of the data covariance matrix, without data-augmentation ( top) and with data-augmentation ( bottom). We clearly observe that except for datasets without background and for which reconstruction and classification are better aligned (recall \ref{['fig:alignment']}), the final performance is greater when employing the subspace of the data that explains the least the pixel variation, i.e., the bottom subspace.
  • Figure 5: Depiction of multiple resnet34 autoencoders with varying embedding dimensions ( light to dark) some trained only to reconstruct the input samples with data-augmentations ( blue) and others with an additional supervised loss signal (as per \ref{['eq:nonlinear']}) ( green). We report the test set accuracy and the relative difference ( y-axis) for each of the "paired" models, i.e., the ones with every training setting identical except for the use of the supervised signal, as a function of the train and test rec loss. We clearly observe that for any embedding dimension and reconstruction loss, one can find two set of parameters with drastically different ability to solve perception tasks. Reconstructed samples and training curves are provided in \ref{['fig:compare']}.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Theorem 1
  • Corollary 1.1
  • Proposition 1
  • Corollary 1.2
  • Theorem 2
  • Theorem 3
  • Corollary 3.1
  • proof
  • proof
  • proof
  • ...and 1 more