When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths
TL;DR
The paper investigates whether a reasoning-optimized system, OpenAI o1, avoids the autoregression-derived limitations seen in prior large language models. It uses a teleological framework to analyze performance across tasks and focuses on two sensitivity axes: output probability and task frequency, using accuracy and thinking-token metrics. The findings show substantial gains for o1 versus earlier models, especially on rare variants, but persistent probability-driven effects remain, indicating incomplete mitigation of autoregressive biases. The work highlights that fully overcoming autoregression requires combining reasoning optimization with non-probabilistic components or alternative architectures, with implications for designing more robust, instruction-following AI systems.
Abstract
In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 -- like previous LLMs -- is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.
