On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study
Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti
TL;DR
This work presents a controlled study of reasoning in decoder-only transformers using a synthetic layered-DAG shortest-path problem to isolate the effect of reasoning trace structure. It compares efficient dynamic-programming-like traces to longer, backtracking traces and finds that the latter generalize better under the same token budget, driven by higher next-token prediction confidence rather than mere trace length. The results show that structured, incremental traces facilitate learning, while naive or overly verbose traces can hinder generalization; training dynamics and temperature during sampling further modulate performance. The findings suggest that for next-token predictors, the quality and structure of the reasoning signal matter more than global optimality, with implications for designing CoT strategies and training curricula in LLMs. The study highlights the importance of understanding inductive biases to guide reasoning in AI systems and motivates further exploration in more naturalistic tasks and larger models.
Abstract
Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
