Table of Contents
Fetching ...

What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)

Zixuan Gong, Jiaye Teng, Yong Liu

TL;DR

The paper addresses why looped transformers better handle complex reasoning by analyzing two levels of training dynamics and introducing a refined loss-landscape model that distinguishes River-U-Valley and River-V-Valley geometries. It provides theoretical results showing that the River-V-Valley landscape induces a more effective optimization path for Looped-Attn, including a two-phase learning dynamic and enhanced generalization to longer sequences. Building on this, the authors propose SHIFT, a two-stage training framework that preserves the efficiency of Single-Attn while achieving Looped-Attn-level performance through staged depth and a principled switch criterion. The work offers a principled perspective on inductive bias from recursion, connects landscape geometry to practical training strategies, and suggests routes for efficient deployment of recursive architectures in large-scale models.

Abstract

While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)

TL;DR

The paper addresses why looped transformers better handle complex reasoning by analyzing two levels of training dynamics and introducing a refined loss-landscape model that distinguishes River-U-Valley and River-V-Valley geometries. It provides theoretical results showing that the River-V-Valley landscape induces a more effective optimization path for Looped-Attn, including a two-phase learning dynamic and enhanced generalization to longer sequences. Building on this, the authors propose SHIFT, a two-stage training framework that preserves the efficiency of Single-Attn while achieving Looped-Attn-level performance through staged depth and a principled switch criterion. The work offers a principled perspective on inductive bias from recursion, connects landscape geometry to practical training strategies, and suggests routes for efficient deployment of recursive architectures in large-scale models.

Abstract

While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

Paper Structure

This paper contains 35 sections, 8 theorems, 164 equations, 25 figures, 1 table.

Key Result

Theorem 1

Under Setting ass:loss_quardatic, we define $\mathcal{C}$ as the upper bound of cumulative force generated by the valley dynamics on the river subspace, then it holds that where $\Phi = I - \eta H_{\text{Valley}}$ with a learning rate $\eta$, and $\{\lambda_i\}$ is the spectrum of valley Hessian $H_{\text{Valley}}$.

Figures (25)

  • Figure 1: Loss Landscapes, Optimization Trajectories and SHIFT Strategy.
  • Figure 2: Generation of Markov Language Sequences.
  • Figure 3: Data Distribution, Task-Level Performance and Hessian-Level Dynamic. (a) Long-tail distribution of the dataset shown by Information Content. (b) Training accuracy on low information, high information and total sequences. (c) Matrix entropy metric. (d) Mutual information metric.
  • Figure 4: SHIFT Efficiency and Performance on Markov Dataset.
  • Figure 5: Length Generalization.
  • ...and 20 more figures

Theorems & Definitions (24)

  • Definition 1: River-Valley Loss Landscape
  • Conjecture 1: Single-Attn: Flat Valley Trapping
  • Conjecture 2: Looped-Attn: From Steep Valley Hopping to River Convergence
  • Theorem 1: Cumulative Force under Quadratic Loss
  • Corollary 1: Greater Cumulative Force of Looped-Attn
  • Corollary 2: Superior Optimization Performance of Looped-Attn
  • Theorem 2: Superior Optimization Performance of Looped-Attn under General Loss
  • Theorem 3: Shared River Upstream
  • Definition 2: Block-Structured Hessian
  • proof
  • ...and 14 more