Table of Contents
Fetching ...

Peeking Behind the Curtains of Residual Learning

Tunhou Zhang, Feng Yan, Hai Li, Yiran Chen

TL;DR

This work explains why plain deep nets struggle due to dissipating inputs under nonlinearities and shows residual paths help by preserving input information. It introduces The Plain Neural Net Hypothesis (PNNH) and the PNNH paradigm, a dual-path internal structure with a weight-sharing coder plus a learner to maintain input representations across layers. The authors provide a theoretical bound on input preservation and demonstrate that an internal path improves this bound relative to plain nets, enabling deep plain CNNs and Transformers to reach competitive performance. Empirically, PNNH-enabled networks on CIFAR-10/100 and ImageNet-1K achieve on-par accuracy with residual counterparts while offering up to 2× parameter efficiency and up to 0.3× higher training throughput, validating a practical path to deep plain nets.

Abstract

The utilization of residual learning has become widespread in deep and scalable neural nets. However, the fundamental principles that contribute to the success of residual learning remain elusive, thus hindering effective training of plain nets with depth scalability. In this paper, we peek behind the curtains of residual learning by uncovering the "dissipating inputs" phenomenon that leads to convergence failure in plain neural nets: the input is gradually compromised through plain layers due to non-linearities, resulting in challenges of learning feature representations. We theoretically demonstrate how plain neural nets degenerate the input to random noise and emphasize the significance of a residual connection that maintains a better lower bound of surviving neurons as a solution. With our theoretical discoveries, we propose "The Plain Neural Net Hypothesis" (PNNH) that identifies the internal path across non-linear layers as the most critical part in residual learning, and establishes a paradigm to support the training of deep plain neural nets devoid of residual connections. We thoroughly evaluate PNNH-enabled CNN architectures and Transformers on popular vision benchmarks, showing on-par accuracy, up to 0.3% higher training throughput, and 2x better parameter efficiency compared to ResNets and vision Transformers.

Peeking Behind the Curtains of Residual Learning

TL;DR

This work explains why plain deep nets struggle due to dissipating inputs under nonlinearities and shows residual paths help by preserving input information. It introduces The Plain Neural Net Hypothesis (PNNH) and the PNNH paradigm, a dual-path internal structure with a weight-sharing coder plus a learner to maintain input representations across layers. The authors provide a theoretical bound on input preservation and demonstrate that an internal path improves this bound relative to plain nets, enabling deep plain CNNs and Transformers to reach competitive performance. Empirically, PNNH-enabled networks on CIFAR-10/100 and ImageNet-1K achieve on-par accuracy with residual counterparts while offering up to 2× parameter efficiency and up to 0.3× higher training throughput, validating a practical path to deep plain nets.

Abstract

The utilization of residual learning has become widespread in deep and scalable neural nets. However, the fundamental principles that contribute to the success of residual learning remain elusive, thus hindering effective training of plain nets with depth scalability. In this paper, we peek behind the curtains of residual learning by uncovering the "dissipating inputs" phenomenon that leads to convergence failure in plain neural nets: the input is gradually compromised through plain layers due to non-linearities, resulting in challenges of learning feature representations. We theoretically demonstrate how plain neural nets degenerate the input to random noise and emphasize the significance of a residual connection that maintains a better lower bound of surviving neurons as a solution. With our theoretical discoveries, we propose "The Plain Neural Net Hypothesis" (PNNH) that identifies the internal path across non-linear layers as the most critical part in residual learning, and establishes a paradigm to support the training of deep plain neural nets devoid of residual connections. We thoroughly evaluate PNNH-enabled CNN architectures and Transformers on popular vision benchmarks, showing on-par accuracy, up to 0.3% higher training throughput, and 2x better parameter efficiency compared to ResNets and vision Transformers.
Paper Structure (17 sections, 4 theorems, 12 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 4 theorems, 12 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Proposition 3.1

(Curse of Non-linearity). ReLU non-linearity: $ReLU(x)=\max(0, x)$ is a low-rank operator that induces information loss in neuron responses. We define "dissipating inputs" as losing 1-$\epsilon$ of the original information from negative neurons that are zeroed by ReLU non-linearity.

Figures (6)

  • Figure 1: While plain CNNs does not connect features from source to deeper layers, the input information is gradually lost as is reflected from dissimilarity score (DISTS), causing "dissipating input". ResNet addresses "dissipating input" by establishing an explicit residual connection, and our proposed PNNH paradigm derives theoretical foundations to incorporate residuals into plain architecture.
  • Figure 2: An example of applying PNNH on the 3rd stage of ResNet-34. We maintain the original block in the "stride=2" block, and incorporate the PNNH mechanism to "stride=1" blocks.
  • Figure 3: CIFAR-10 Accuracy-Parameter trade-off on various ConvNets. Number: Network Depth.
  • Figure 4: Training throughput of plain/residual ConvNets. Missing bar: Out-of-memory on GPU.
  • Figure 5: Overview of (left) plain learning and (right) residual learning. We mark activations in blue and mark weights in red.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 2.1