Table of Contents
Fetching ...

Step by Step Network

Dongchen Han, Tianzhu Ye, Zhuofan Xia, Kaiyi Chen, Yulin Wang, Hanting Chen, Gao Huang

TL;DR

A generalized residual architecture dubbed Step by Step Network (StepsNet) is proposed to bridge the gap between theoretical potential and practical performance of deep models, and is positioned as a superior generalization of the widely adopted residual architecture.

Abstract

Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep-layer learning, while the inherent depth-width trade-off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro design applicable to various models. Extensive experiments show that our method consistently outperforms residual models across diverse tasks, including image classification, object detection, semantic segmentation, and language modeling. These results position StepsNet as a superior generalization of the widely adopted residual architecture.

Step by Step Network

TL;DR

A generalized residual architecture dubbed Step by Step Network (StepsNet) is proposed to bridge the gap between theoretical potential and practical performance of deep models, and is positioned as a superior generalization of the widely adopted residual architecture.

Abstract

Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep-layer learning, while the inherent depth-width trade-off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro design applicable to various models. Extensive experiments show that our method consistently outperforms residual models across diverse tasks, including image classification, object detection, semantic segmentation, and language modeling. These results position StepsNet as a superior generalization of the widely adopted residual architecture.

Paper Structure

This paper contains 15 sections, 6 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Training accuracy (left) and test accuracy (right) on ImageNet-1K as we gradually increase model depth while keeping the width fixed. The residual architecture DeiT does not deliver satisfactory results as depth increases. In contrast, our method enables the model to leverage increased depth more effectively, achieving much higher results on both training and test sets. Notably, the DeiT models used in this pilot study are not standard DeiT-T/S/B. Please refer to \ref{['sec:exp_deeper']} for experiment details.
  • Figure 2: The shortcut ratio $\gamma_l=\frac{\sigma_0}{\sigma_l}$ in DeiT and Steps-DeiT, where $\sigma_0$ and $\sigma_l$ are the standard deviations of the input $z_0$ and feature after $l$ blocks $z_l$, respectively. The depth is normalized to $[0, 1]$, where 0 and 1 denote input and output. In a very deep residual model (more than 400 layers), the shortcut ratio $\frac{\sigma_0}{\sigma_l}$ approaches zero at early training stages, which prevents later residual blocks from obtaining input information and propagating its gradient back to the input, thus leading to optimization difficulties.
  • Figure 3: An illustration of a Transformer block to help understand the analyses in \ref{['sec:challenges']}.
  • Figure 4: An illustration of the proposed Step by Step Network. For simplicity, the shortcut in each residual block is omitted. (a) Width-depth trade-off. When enlarging the depth of a residual model, the width has to be reduced to maintain similar computation. In contrast, StepsNet makes it possible for the model to be deeper with fixed width and computation. (b) The 2-step network. Given an input $x\in \mathbb{R}^{N\times C}$, the model first splits the information into two parts $x_1, x_2$ along the channel dimension $C$. Subsequently, $x_1$ and $x_2$ are processed sequentially by two networks $\mathcal{F}_1$ and $\mathcal{F}_2$, generating the final output $y_2$. (c) The $n$-step network. Repeatedly substituting the first network with a 2-step architecture creates an $n$-step network, which divides $x$ into $n$ parts and processes them progressively.
  • Figure 5: Comparison with advanced methods on ImageNet-1K.
  • ...and 1 more figures