Table of Contents
Fetching ...

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

Hung-Hsuan Chen

Abstract

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

Abstract

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.
Paper Structure (31 sections, 8 equations, 3 figures, 2 tables)

This paper contains 31 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Accuracy heatmap for the graph reachability task. A sharp computational frontier exists along the diagonal, confirming the strict 1-hop $=$ 1-step physical constraint enforced by adjacency masking. The model generalizes perfectly to 8 hops (OOD) with sufficient steps, but collapses abruptly to chance (${\sim}50\%$) at 10 hops. Dashed vertical lines mark the step training boundary (5--8 thinking steps); columns outside this range are also OOD, yet accuracy is preserved.
  • Figure 2: Accuracy heatmap for the nested boolean expression evaluation task. Compared to the graph task, the computational frontier here is more gradual. The model exhibits robust out-of-distribution generalization with graceful degradation, maintaining ${>}90\%$ accuracy up to depth 14. Notice that the performance remains highly stable even when the model is unrolled for 24 steps (well beyond the training range of 4--16 steps), which validates the effectiveness of our design.
  • Figure 3: Accuracy heatmap for the relational composition task over unstructured text. Without task-aligned structural inductive biases (no adjacency masking; while RoPE is retained for local word order, the fully shuffled input facts ensure that 1D relative positions carry no meaningful structural signal), the task exhibits a strictly monotonic increase in difficulty. The invariant reasoning core discovers latent pointer-chasing routes, achieving solid OOD generalization at depths 6 and 7 when additional thinking steps are provided. Dashed lines mark the training boundaries in both the depth and step dimensions.