A Theory for Length Generalization in Learning to Reason

Changnan Xiao; Bing Liu

A Theory for Length Generalization in Learning to Reason

Changnan Xiao, Bing Liu

TL;DR

This work addresses length generalization (LG) in reasoning tasks by modeling reasoning steps as directed acyclic graphs (DAGs). It introduces the maximal input distance $R$ and the stronger notion of $(n,r)$-consistency to characterize when LG is achievable, showing LG is possible if $R<\infty$ or, in the harder $R=\infty$ case, if the problem is $(n,r)$-consistent. The theory is validated with experiments in a vanilla Transformer across parity, addition, and multiplication tasks, demonstrating LG under the proposed conditions and highlighting how CoT formulations influence learnability. The results provide a theoretical foundation connecting CoT-based reasoning with generalization across problem lengths, with implications for designing representations and training regimes that achieve robust long-horizon reasoning.

Abstract

Length generalization (LG) is a challenging problem in learning to reason. It refers to the phenomenon that when trained on reasoning problems of smaller lengths or sizes, the resulting model struggles with problems of larger sizes or lengths. Although LG has been studied by many researchers, the challenge remains. This paper proposes a theoretical study of LG for problems whose reasoning processes can be modeled as DAGs (directed acyclic graphs). The paper first identifies and proves the conditions under which LG can be achieved in learning to reason. It then designs problem representations based on the theory to learn to solve challenging reasoning problems like parity, addition, and multiplication, using a Transformer to achieve perfect LG.

A Theory for Length Generalization in Learning to Reason

TL;DR

This work addresses length generalization (LG) in reasoning tasks by modeling reasoning steps as directed acyclic graphs (DAGs). It introduces the maximal input distance

and the stronger notion of

-consistency to characterize when LG is achievable, showing LG is possible if

or, in the harder

case, if the problem is

-consistent. The theory is validated with experiments in a vanilla Transformer across parity, addition, and multiplication tasks, demonstrating LG under the proposed conditions and highlighting how CoT formulations influence learnability. The results provide a theoretical foundation connecting CoT-based reasoning with generalization across problem lengths, with implications for designing representations and training regimes that achieve robust long-horizon reasoning.

Abstract

Paper Structure (17 sections, 10 theorems, 33 equations, 4 figures, 3 tables)

This paper contains 17 sections, 10 theorems, 33 equations, 4 figures, 3 tables.

Introduction
Overview of the Proposed LG Theory
The Proposed LG Theory
Given the Directed Acyclic Graph (DAG)
Given Only the Unstructured Sequence
Dealing with $R = \infty$
Related Work
Experiments
Experimental Problems
Results and Analysis
Conclusion
Proofs - Causal Function
Proofs - Recursive Formula
Proof - Maximal Input Element Distance of a Reasoning Step
Proof - $(n, r)$-Consistency
...and 2 more sections

Key Result

Theorem 3.1

For $|X| < \infty$ and $\sup |p(v)| < \infty$, i.e., $|\mathbf{X}| < \infty$, if $D = \mathbf{X}$, then there exists an approximation function $\Hat{f}: X^{\sup |p(v)|} \rightarrow X$, s.t. $\Hat{f}(p(v)) = f(p(v)),\,\forall\, p(v) \in \mathbf{X}$.

Figures (4)

Figure 1: An example DAG.
Figure 2: An example of notations.
Figure 3: Test results in accuracy.
Figure 4: Two examples of the ko problem.

Theorems & Definitions (21)

Theorem 3.1
Corollary 3.1.1
Corollary 3.1.2
Theorem 3.2
Theorem 3.3
Definition 3.1
Theorem 3.4
Theorem 3.5
Theorem 3.6
Theorem 3.7
...and 11 more

A Theory for Length Generalization in Learning to Reason

TL;DR

Abstract

A Theory for Length Generalization in Learning to Reason

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (21)