When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

R. Thomas McCoy; Shunyu Yao; Dan Friedman; Mathew D. Hardy; Thomas L. Griffiths

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths

TL;DR

The paper investigates whether a reasoning-optimized system, OpenAI o1, avoids the autoregression-derived limitations seen in prior large language models. It uses a teleological framework to analyze performance across tasks and focuses on two sensitivity axes: output probability and task frequency, using accuracy and thinking-token metrics. The findings show substantial gains for o1 versus earlier models, especially on rare variants, but persistent probability-driven effects remain, indicating incomplete mitigation of autoregressive biases. The work highlights that fully overcoming autoregression requires combining reasoning optimization with non-probabilistic components or alternative architectures, with implications for designing more robust, instruction-following AI systems.

Abstract

In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 -- like previous LLMs -- is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

TL;DR

Abstract

Paper Structure (6 sections, 4 figures)

This paper contains 6 sections, 4 figures.

Introduction
Background: o1
Results
Output probability
Task frequency
Conclusion

Figures (4)

Figure 1: Across the four tasks we considered (shift ciphers, Pig Latin, article swapping, and reversal), all six LLMs evaluated here---including o1---show sensitivity to output probability, with higher accuracies on examples that have a high output probability than on examples that have a low output probability. The results for all models except o1 are from mccoy2023embers. The intervals around the lines show one standard error.
Figure 2: o1 tends to use more tokens when processing examples that have low-probability answers than examples that have high-probability answers. The plots show the median number of tokens that o1 used for each group of examples.
Figure 3: Left: We evaluated LLMs on two variants of five tasks---a variant that is common in Internet text (e.g., forming acronyms from the first letter of each word in a sequence) and a variant that is rare (e.g., forming acronyms from the second letter of each word in a sequence). On these datasets, the five LLMs other than o1 showed much higher accuracy on the common variants than the rare ones, but o1 showed similar performance on common and rare variants. The results for models other than o1 are from mccoy2023embers. Top right: On datasets based on challenging sorting tasks, o1 performs better on the common type of sorting (i.e., sorting into alphabetical order) than on the rare type of sorting (i.e., sorting into reverse alphabetical order). Bottom right: When decoding shift ciphers, o1 shows roughly the same performance on the common cipher type and on the rare cipher type when the examples are ones with a high output probability. However, when it is instead evaluated on examples with medium or low probability, its accuracy is higher for the common cipher type than the rare one. The error intervals in all plots show one standard error.
Figure 4: In some cases---namely, for shift ciphers and acronyms---o1 consumes more tokens when performing a rare task variant than a common task variant. For the other task pairs, the number of tokens it consumes is similar across both task frequency levels. The bars show the median number of tokens used within each group of examples. Note that the vertical axes have different scales in each plot.

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

TL;DR

Abstract

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

Authors

TL;DR

Abstract

Table of Contents

Figures (4)