Table of Contents
Fetching ...

Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces

William L. Tong, Ege Cakar, Cengiz Pehlevan

TL;DR

This work introduces PITA, a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs, and proposes notions of task depth and task breadth, which measure respectively the number of steps required to solve an example from a task and the number of unique examples across a task.

Abstract

Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.

Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces

TL;DR

This work introduces PITA, a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs, and proposes notions of task depth and task breadth, which measure respectively the number of steps required to solve an example from a task and the number of unique examples across a task.

Abstract

Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.
Paper Structure (46 sections, 4 theorems, 50 equations, 11 figures, 2 tables)

This paper contains 46 sections, 4 theorems, 50 equations, 11 figures, 2 tables.

Key Result

Proposition 2.1

Under the sample-coordinate representation, suppose that the magnitude of sample coordinates $|c_k^{i,n}|$ is the same across all depth indices $k$, activation indices $i$, and branch indices $n$. If $L = 2$ and the task parameters $B, D \gg 1$, then the following properties hold for the set of samp

Figures (11)

  • Figure 1: The PITA dataset.(a) Statements and proofs are expressed in Lean. Tactics are Lean commands that transform sets of propositions. A proof is an alternating sequence of tactics and proof states, which terminates in a special token indicating whether the statement is true or false. (b) Inputs are formatted as prompt-completion pairs, where losses are computed only on completion tokens. The DP model's completion consists only of the special classification token. The RT model's completion includes the whole proof. (c) Breadth of each split, plotted as the the total number of unique examples that can be enumerated for a particular statement size (where size is the number of atoms in that statement). (d) Depth of each split, plotted as the number of unique proof states. Boxplots are constructed from 500 samples from each split, and illustrate the median and quartiles. Outliers are determined from 1.5 times the inter-quartile range. For additional details on our splits and how we measure them, see Appendix \ref{['app:details']}.
  • Figure 2: Generalization accuracy on PITA splits. Models are trained up to median proof length of their respective splits, then evaluated on longer examples. The dashed line marks chance level performance, where chance is calculated as the test accuracy attained through random guessing with probability equal to the proportion of true/false examples in the training distribution. Because true/false proportion may vary widely between train and test distributions, chance level is sometimes substantially below 50 percent, as in Imply. RT models typically outperform DP models on breadth-dominated splits (Full and Imply), while the reverse is true for depth-dominated splits (Or and PHP). Due to the computational constraints, the largest trainable model is 7B for PHP and 32B for all others. Boxplots are constructed from 10 runs, where each model is evaluated on 100 test samples. Box lines illustrate the median and quartiles. Outliers are determined from 1.5 times the inter-quartile range.
  • Figure 3: Transitive inference task.(a) Illustration of the TI task. Symbols are arranged in a series of parallel branches, each consisting of a line of inferences. Breadth is parameterized by the number of branches $B$, while depth is parameterized by the number of symbols in a branch $D$. (b) Generalization accuracy for fixed depth $D = 30$ and varying breadth. The red dashed line indicates the max training length. Generalization accuracy for the DP model decays quickly with breadth, while remaining consistently high for the RT model. Shaded error regions correspond to 95 percent confidence intervals estimated from six seeds. (c) Heatmaps of training accuracy for varying depth, breadth, and model size for the full model described in Section \ref{['sec:model']}. Scalings are plotted in cyan, and agree closely with the high accuracy contours.
  • Figure 4: Attention weights are uniform. We show empirically that the attention weights become uniform in a Transformer trained on our transitive inference task. (a)In blue, total variation distance (TVD) between a uniform distribution and the attention weights in an RT model, measured across 1000 examples. The x-axis indicates the position of the query token. In orange, the probability assigned to each position by a uniform distribution, plotted for comparison. The TVD is very close to zero across all positions, indicating that the attention weights are close to uniform. (b)Left, an example attention matrix from the RT model. Right, attention matrix with uniform entries, plotted for comparison. (c) Histogram of TVD between a uniform distribution and attention weights in a DP model at position 2 (the output position for a DP model), measured across 1000 examples. In orange, the probability assigned to position 2 by a uniform distribution (which is 0.5). As in (a), the TVD is fairly close to 0. (d) The same as (b), plotted with the DP model.
  • Figure 5: Trained models learn the max margin solution. We plot the kernel density estimate (KDE) over the proportion of positive, estimated sample coordinates per task branch in (left) a direct-prediction model and (right) reasoning trace model. The proportion is indexed by the x-axis. The y-axis indexes the readout weight of the corresponding weight vector, from which the sample coordinates are estimated. For positive readout weights, we see that proportions are bimodal around 0 and 1. As the readout weight becomes negative, proportions tend to coalesce in a single range. These observations are consistent with implementing a max margin solution. Given a weight vector $\mathbf{w}_i$ and symbol embedding $\mathbf{x}_j$, the corresponding sample coordinate is estimated as $c_j^i = \mathbf{w}_i \cdot \mathbf{x}_j$.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Proposition 2.1
  • proof
  • Proposition 2.2
  • proof
  • Proposition 2.3
  • proof
  • Proposition 2.4
  • proof