Table of Contents
Fetching ...

On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti

TL;DR

This work presents a controlled study of reasoning in decoder-only transformers using a synthetic layered-DAG shortest-path problem to isolate the effect of reasoning trace structure. It compares efficient dynamic-programming-like traces to longer, backtracking traces and finds that the latter generalize better under the same token budget, driven by higher next-token prediction confidence rather than mere trace length. The results show that structured, incremental traces facilitate learning, while naive or overly verbose traces can hinder generalization; training dynamics and temperature during sampling further modulate performance. The findings suggest that for next-token predictors, the quality and structure of the reasoning signal matter more than global optimality, with implications for designing CoT strategies and training curricula in LLMs. The study highlights the importance of understanding inductive biases to guide reasoning in AI systems and motivates further exploration in more naturalistic tasks and larger models.

Abstract

Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.

On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

TL;DR

This work presents a controlled study of reasoning in decoder-only transformers using a synthetic layered-DAG shortest-path problem to isolate the effect of reasoning trace structure. It compares efficient dynamic-programming-like traces to longer, backtracking traces and finds that the latter generalize better under the same token budget, driven by higher next-token prediction confidence rather than mere trace length. The results show that structured, incremental traces facilitate learning, while naive or overly verbose traces can hinder generalization; training dynamics and temperature during sampling further modulate performance. The findings suggest that for next-token predictors, the quality and structure of the reasoning signal matter more than global optimality, with implications for designing CoT strategies and training curricula in LLMs. The study highlights the importance of understanding inductive biases to guide reasoning in AI systems and motivates further exploration in more naturalistic tasks and larger models.

Abstract

Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.

Paper Structure

This paper contains 27 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 3: Learning to find the shortest path. (a) Generalization performance of two models trained on $\sim340$K graphs, respectively without reasoning traces (dashed) and with the $\eta=+5$ (DP) traces (full), over graphs with depths $3$-$5$-$7$. (b) Progress on intermediate training goals for the $\eta=+5$ (DP) model. (c) Acquisition of the integer addition sub-task, during the $\eta=+5$ (DP) training. The plot shows the probability of the model predicting the correct $row+column$ sums at different epochs.
  • Figure 4: Impact of trace efficiency. (a) Comparison of the generalization performance between models trained on efficient $\eta=+5$ (DP), intermediate $\eta=0$, and inefficient $\eta=-5$ (DFS) traces, with a training token budget of $32$M (dashed) and $128$M (full) tokens. (b) Next-token confidence measured on the test set of models trained on efficient $\eta=+5$ (DP), intermediate $\eta=0$, and inefficient $\eta=-5$ (DFS) traces with a training token budget of $128$M. (c) Training losses for 5 different seeds of $\eta=-5$ (DFS), showing sudden jumps at the beginning of the 2nd epoch, and of $\eta=+5$ (DP), where optimization is slower and more continuous.
  • Figure 5: Redundant traces. (a) Comparison of generalization performance between models trained on traces with efficiency $\eta=+5$ (DP), $\eta=+5$ (DR), and $\eta=+5$ (RR) (with sampling temperatures $T=1$ (dashed) and $T=0$ (full)), trained with a $128$M token budget. (b) Regularization effect of sampling temperature on $\eta=+5$ (RR), where the answer accuracy improves and the average CoT length converges to the expected one from training data at higher temperatures.
  • Figure 8: Impact of sampling temperature. (a) The length of the CoTs produced by $\eta=0$ model (full) initially converges to the expected length of the inefficient traces, $\eta=-5$ (DFS), gradually recovering after many epochs. By sampling at positive temperature (dashed and dotted), the length converges to the expected one for $\eta=0$. (b) While converging to the expected number of reasoning steps for the $\eta=0$ strategy, the $\eta=0$ model also achieves better answer accuracy at non-zero temperatures.
  • Figure 11: Reasoning steps metrics. Comparison of CoT-level metrics between models trained on traces with efficiency $\eta=+5$(DP) and $\eta=-5$(DFS). Panels: (a) percentage of subproblem optimal steps, (b) percentage of repeated reasoning steps, (c) percentage of consistent steps, (d) percentage of possible sub-paths, (e) percentage of steps with a skipped subproblem, (f) average numbers of syntax errors.
  • ...and 5 more figures