Table of Contents
Fetching ...

Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang, Shuaiyi Nie, Shuohuan Wang, Yu Sun, Hua Wu, HaiFeng Wang, Tingwen Liu

Abstract

Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.

Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Abstract

Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.
Paper Structure (55 sections, 1 theorem, 24 equations, 15 figures, 10 tables, 1 algorithm)

This paper contains 55 sections, 1 theorem, 24 equations, 15 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1

For the $i$-th row of a causal attention matrix $\bar{A}$ with entropy $\mathcal{E}_i = -\sum_{j=1}^{i} \bar{A}_{ij} \log \bar{A}_{ij}$, the diagonal element satisfies:

Figures (15)

  • Figure 1: Visualization of attention heatmap in low- and high-entropy heads from Qwen3-0.6B. Along the horizontal axis, tokens highlighted in red denote the subset that receives the top 50% of the attention from the final query position (details in Appendix \ref{['app: Qualitative Analysis']}).
  • Figure 2: Evolution of Intra-layer Variance in Head-wise Attention Entropy Across Training Steps for Models Pretrained from Scratch (we visualize four representative layers; see Figure \ref{['fig: training/head_variance_unified']} for detailed results)
  • Figure 3: The Sparse Growing Transformer (SGT) architecture. It implements Training-Phase Structural Sparsity via a deep-to-shallow progressive growth schedule (Left) and selective high-entropy head looping (Right).
  • Figure 4: Comparative convergence trajectories of training perplexity (PPL) versus cumulative training FLOPs for the 573M model scale.
  • Figure 5: Attention entropy dynamics of two heads in Layer 6 during training, including the warm-up phase and layer selection phase, for models trained with D2S and S2D strategies (more visualizations in Figure \ref{['fig: direction_stable']}).
  • ...and 10 more figures

Theorems & Definitions (2)

  • Lemma 1: Entropy-Based Bound on Diagonal Elements
  • proof : Proof Sketch