ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Yilang Zhang; Bingcong Li; Niao He; Georgios B. Giannakis

ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Yilang Zhang, Bingcong Li, Niao He, Georgios B. Giannakis

TL;DR

This paper investigates how the topology of residual connections governs optimization and depth efficiency in deep networks. It introduces ANCRe, a lightweight framework that parameterizes all potential shortcuts with coefficients $c_{ij}$ and uses a softmax routing with temperature $\tau$ to learn a data-driven topology, adding $K(K-1)/2$ parameters and incurring overhead $<1\%$. The authors prove in deep linear networks that different shortcut layouts can yield an exponential gap in convergence rates, and show that ANCRe can attain linear convergence by learning an effective topology. They validate ANCRe across pre-training of large language models, diffusion models, and ResNets, reporting faster convergence, improved perplexities and FID, and enhanced depth efficiency with minimal computational cost.

Abstract

Scaling network depth has been a central driver behind the success of modern foundation models, yet recent investigations suggest that deep layers are often underutilized. This paper revisits the default mechanism for deepening neural networks, namely residual connections, from an optimization perspective. Rigorous analysis proves that the layout of residual connections can fundamentally shape convergence behavior, and even induces an exponential gap in convergence rates. Prompted by this insight, we introduce adaptive neural connection reassignment (ANCRe), a principled and lightweight framework that parameterizes and learns residual connectivities from the data. ANCRe adaptively reassigns residual connections with negligible computational and memory overhead ($<1\%$), while enabling more effective utilization of network depth. Extensive numerical tests across pre-training of large language models, diffusion models, and deep ResNets demonstrate consistently accelerated convergence, boosted performance, and enhanced depth efficiency over conventional residual connections.

ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

TL;DR

and uses a softmax routing with temperature

to learn a data-driven topology, adding

parameters and incurring overhead

. The authors prove in deep linear networks that different shortcut layouts can yield an exponential gap in convergence rates, and show that ANCRe can attain linear convergence by learning an effective topology. They validate ANCRe across pre-training of large language models, diffusion models, and ResNets, reporting faster convergence, improved perplexities and FID, and enhanced depth efficiency with minimal computational cost.

Abstract

), while enabling more effective utilization of network depth. Extensive numerical tests across pre-training of large language models, diffusion models, and deep ResNets demonstrate consistently accelerated convergence, boosted performance, and enhanced depth efficiency over conventional residual connections.

Paper Structure (24 sections, 16 theorems, 54 equations, 9 figures, 8 tables)

This paper contains 24 sections, 16 theorems, 54 equations, 9 figures, 8 tables.

Introduction
Related work
Residual topology matters: a case study
Linear neural networks with residual connections
Convergence analysis of exponential discrepancies
Learning residual connections from data
ANCRe: Adaptive neural connection reassignment
Applying ANCRe to Transformers
Numerical experiments
Pre-training of LLMs
Pre-training of diffusion models
Reinforcement learning with ResNets
Ablation study
Conclusion and outlook
Additional related work
...and 9 more sections

Key Result

Theorem 3.2

Consider a 3-layer LNN with residual connection 0:1, and regression loss For some sufficiently small initialization, GF in eq:GF cannot converge faster than a sublinear rate

Figures (9)

Figure 1: Visualization of linear neural network (LNN) of $K = 3$ layers.
Figure 2: Convergence comparison of LNN under varying setups.
Figure 2: Perplexity ($\downarrow$) comparison of ANCRe and cascaded residual connections by pre-training LLaMA models of varying sizes. The better of the two are marked in solid lines.
Figure 3: Visualization of ANCRe and two normalization schemes on a 3-layer LNN.
Figure 4: ANCRe applied to the standard Transformers comprising layer normalization (LN), multi-head self-attention (MHSA), and feedforward network (FFN) modules.
...and 4 more figures

Theorems & Definitions (30)

Theorem 3.2: Lower bound
Theorem 3.3: Upper bound
Theorem B.1: Formal restatement of Theorem \ref{['thm:LB']}
proof
Lemma B.2
proof
Lemma B.3
proof
Lemma B.4
proof
...and 20 more

ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

TL;DR

Abstract

ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (30)