Table of Contents
Fetching ...

An overview of condensation phenomenon in deep learning

Zhi-Qin John Xu, Yaoyu Zhang, Zhangchen Zhou

TL;DR

The paper investigates condensation, a phenomenon in nonlinear neural-network training where neurons within the same layer cluster into groups with similar outputs, with the cluster count typically increasing over time. It synthesizes evidence across simple two-layer nets, CNNs, and Transformer-relevant models, and analyzes the training dynamics, loss landscapes, and the role of dropout in driving condensation. A phase-diagram framework is presented to distinguish linear and nonlinear (condensation) regimes in the infinite-width limit, along with the embedding principle that links wider and narrower networks, and implications for generalization and pruning. The work connects condensation to improved generalization, potential pruning strategies, and enhanced reasoning in Transformer-like architectures, offering a new perspective on designing and training efficient deep networks.

Abstract

In this paper, we provide an overview of a common phenomenon, condensation, observed during the nonlinear training of neural networks: During the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. Empirical observations suggest that the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses. Neural networks with small weight initializations or Dropout optimization can facilitate this condensation process. We also examine the underlying mechanisms of condensation from the perspectives of training dynamics and the structure of the loss landscape. The condensation phenomenon offers valuable insights into the generalization abilities of neural networks and correlates to stronger reasoning abilities in transformer-based language models.

An overview of condensation phenomenon in deep learning

TL;DR

The paper investigates condensation, a phenomenon in nonlinear neural-network training where neurons within the same layer cluster into groups with similar outputs, with the cluster count typically increasing over time. It synthesizes evidence across simple two-layer nets, CNNs, and Transformer-relevant models, and analyzes the training dynamics, loss landscapes, and the role of dropout in driving condensation. A phase-diagram framework is presented to distinguish linear and nonlinear (condensation) regimes in the infinite-width limit, along with the embedding principle that links wider and narrower networks, and implications for generalization and pruning. The work connects condensation to improved generalization, potential pruning strategies, and enhanced reasoning in Transformer-like architectures, offering a new perspective on designing and training efficient deep networks.

Abstract

In this paper, we provide an overview of a common phenomenon, condensation, observed during the nonlinear training of neural networks: During the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. Empirical observations suggest that the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses. Neural networks with small weight initializations or Dropout optimization can facilitate this condensation process. We also examine the underlying mechanisms of condensation from the perspectives of training dynamics and the structure of the loss landscape. The condensation phenomenon offers valuable insights into the generalization abilities of neural networks and correlates to stronger reasoning abilities in transformer-based language models.

Paper Structure

This paper contains 17 sections, 2 theorems, 10 equations, 10 figures.

Key Result

Theorem 1

If $\gamma<1$ or $\gamma'>\gamma-1$, then with a high probability over the choice of $\bm{\theta}^0$, we have

Figures (10)

  • Figure 1: The feature maps $\{(\theta_k,A_k)\}_{k}$ of a two-layer ReLU neural network. The red dots and the gray dots are the features of the active and the static neurons respectively and the blue solid lines are the trajectories of the active neurons during the training. The epochs are described in subcaptions.
  • Figure 2: The feature map of two-layer Tanh neural networks. The red dots are the features of neurons at the terminal stage. The initialization scales are indicated in the subcaptions.
  • Figure 3: Small initialization (convolutional and fully connected layers initially follow $\mathcal{N}(0,96^{-8})$) for single-layer CNN training in its final stage of convergence. The activation function is $\rm{tanh}(x)$. If neurons are in the same dark blue block, then $D(\bm{u},\bm{v})\sim 1$ (in beige blocks, $D(\bm{u},\bm{v})\sim -1$), indicating that their input weight directions are the same (opposite). Colors represent $D(\bm{u},\bm{v})$ of two convolution kernels, with indices shown on the horizontal and vertical axes respectively. The training set is MNIST. The output layer uses softmax, the loss function is cross-entropy, and the optimizer is Adam with full batch training. Convolution kernel size $m=3$, learning rate $=2 \times 10^{-4}$. Training continues until $100\%$ accuracy is achieved on the training set, at this point, the test set accuracy is $97.62\%$.
  • Figure 4: Condensation phenomenon in a ResNet-18 model pre-trained on ImageNet. (a) and (b) show weights from the first and the last convolutional layers of ResNet-18 respectively, and (c) and (d) are the corresponding outputs.
  • Figure 5: Phase diagram of two-layer ReLU NNs at infinite-width limit. The marked examples are studied in existing literature. Table is from Ref. luo2021phase.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 1: Informal statement luo2021phase
  • Theorem 2: Informal statement luo2021phase
  • Definition 1: multiplicity $p$ zhou2022towards