Table of Contents
Fetching ...

Emergence of Invariance and Disentanglement in Deep Representations

Alessandro Achille, Stefano Soatto

TL;DR

The paper develops an information-theoretic framework showing that invariance to nuisance factors in deep representations is equivalent to minimal information content in the learned representations. By unifying the Information Bottleneck with weight-centered regularization and PAC-Bayes perspectives, it explains how SGD and noise injection bias networks toward invariant and disentangled representations, and how a phase transition governs overfitting on random labels. It provides computable bounds linking the information in weights to the informativeness and independence (disentanglement) of activations, and validates these predictions with experiments across architectures and datasets. The work illuminates the interplay between loss geometry, generalization, and invariance, offering practical regularization avenues and deeper theoretical insight into deep representation learning.

Abstract

Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights. Finally, we show that invariance and independence of the components of the representation learned by the network are bounded above and below by the information in the weights, and therefore are implicitly optimized during training. The theory enables us to quantify and predict sharp phase transitions between underfitting and overfitting of random labels when using our regularized loss, which we verify in experiments, and sheds light on the relation between the geometry of the loss function, invariance properties of the learned representation, and generalization error.

Emergence of Invariance and Disentanglement in Deep Representations

TL;DR

The paper develops an information-theoretic framework showing that invariance to nuisance factors in deep representations is equivalent to minimal information content in the learned representations. By unifying the Information Bottleneck with weight-centered regularization and PAC-Bayes perspectives, it explains how SGD and noise injection bias networks toward invariant and disentangled representations, and how a phase transition governs overfitting on random labels. It provides computable bounds linking the information in weights to the informativeness and independence (disentanglement) of activations, and validates these predictions with experiments across architectures and datasets. The work illuminates the interplay between loss geometry, generalization, and invariance, offering practical regularization avenues and deeper theoretical insight into deep representation learning.

Abstract

Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights. Finally, we show that invariance and independence of the components of the representation learned by the network are bounded above and below by the information in the weights, and therefore are implicitly optimized during training. The theory enables us to quantify and predict sharp phase transitions between underfitting and overfitting of random labels when using our regularized loss, which we verify in experiments, and sheds light on the relation between the geometry of the loss function, invariance properties of the learned representation, and generalization error.

Paper Structure

This paper contains 35 sections, 19 theorems, 64 equations, 5 figures, 1 table.

Key Result

Proposition 2.1

Given a joint distribution $p(x,y)$, where $y$ is a discrete random variable, we can always find a random variable $n$ independent of $y$ such that $x=f(y,n)$, for some deterministic function $f$.

Figures (5)

  • Figure 1: (Left) The AlexNet model of zhang2016understanding achieves high accuracy (red) even when trained with random labels on CIFAR-10. Using the IB Lagrangian to limit information in the weights leads to a sharp transition to underfitting (blue) predicted by the theory (dashed line). To overfit, the network needs to memorize the dataset, and the information needed grows linearly. (Right) For real labels, the information sufficient to fit the data without overfitting saturates to a value that depends on the dataset, but somewhat independent of the number of samples. Test accuracy shows a uniform blue plot for random labels, while for real labels it increases with the number of training samples, and is higher near the critical regularizer value $\beta=1$.
  • Figure 2: (Left) Plot of the training error on CIFAR-10 with random labels as a function of the parameter $\beta$ for different models (see the appendix for details). As expected, all models show a sharp phase transition from complete overfitting to underfitting before the critical value $\beta=1$. (Right) We measure the quantity of information in the weights necessary to overfit as we vary the percentage of corrupted labels under the same settings of \ref{['fig:phase-transiction']}. To fit increasingly random labels, the network needs to memorize more information in the weights; the increase needed to fit entirely random labels is about the same magnitude as the size of a label (2.30 nats/sample).
  • Figure 3: Plots of the test error obtained training the All-CNN architecture on CIFAR-10 (no data augmentation). (Left) Test error as we increase the number of weights in the network using weight decay but without any additional explicit regularization. Notice that increasing the number of weights the generalization error plateaus rather than increasing. (Right) Changing the value of $\beta$, which controls the amount of information in the weights, we obtain the characteristic curve of the bias-variance trade-off. This suggests that the quantity of information in the weights correlates well with generalization.
  • Figure 4: (Left) A few training samples generated adding nuisance clutter $n$ to the MNIST dataset. (Right) Reducing the information in the weights makes the representation $z$ learned by the digit classifier increasingly invariant to nuisances ($I(n;z)$ decreases), while sufficiency is retained ($I(z; y) = I(x; y)$ is constant). As expected, $I(z;n)$ is smaller but has a similar behavior to the theoretical bound in \ref{['cor:multi-layer-bound']}.
  • Figure 5: For different values of $\beta$, we show the image $\hat{x}$ reconstructed from a representation $z \sim p(z|x)$ of the original image $x$ in the first column. For small $\beta$, $z$ contains more information regarding $x$, thus the reconstructed image $\hat{x}$ is close to $x$, background included. Increasing $\beta$ decreases the information in the weighs, thus the representation $z$ becomes more invariant to nuisances: Reconstructed image matches important details in $x$ that are preserved in $z$ (i.e., hair color, sex, expression), but background, hair style, and other nuisances are generated anew.

Theorems & Definitions (23)

  • Proposition 2.1: Task-nuisance decomposition, Appendix \ref{['lemma:task-nuisance-proof']}
  • Proposition 3.1: Invariance and minimality, Appendix \ref{['prop:invariance-minimality-proof']}
  • Remark 3.2
  • Corollary 3.3: Invariants from the Information Bottleneck
  • Corollary 3.4: Bottlenecks promote invariance
  • Proposition 3.5: Stacking increases invariance
  • Proposition 3.6: Actionable Information
  • Proposition 4.1: Information in the weights, \ref{['prop:information-weight-proof']}
  • Remark 4.2: On the constant $C$
  • Proposition 4.3: Flat minima have low information, Appendix \ref{['prop:flat-minima-proof']}
  • ...and 13 more