Table of Contents
Fetching ...

The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

Marthe Ballon, Andres Algaba, Vincent Ginis

TL;DR

It is shown that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions, and that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve.

Abstract

Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

TL;DR

It is shown that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions, and that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve.

Abstract

Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

Paper Structure

This paper contains 3 sections, 6 equations, 12 figures.

Table of Contents

  1. Methods
  2. Figures
  3. Tables

Figures (12)

  • Figure 1: Accuracy comparison of OpenAI models gpt-4o, o1-mini, o3-mini (m) and o3-mini (h) on the Omni-MATH benchmark. This figure displays the accuracy of gpt-4o, o1-mini, o3-mini (m) and o3-mini (h) on the Omni-MATH benchmark across disciplines and difficulty tiers. The gpt-4o model fails to attain $50\%$ in any category and consistently lags behind the reasoning models. o1-mini significantly improves accuracy, reaching accuracies of $40$-$60\%$ across all domains, while the o3-models surpass $50\%$ accuracy in all categories. In general, accuracy declines as difficulty increases, with the exception of gpt-4o, which shows accuracy vs. difficulty level imbalance for Tiers $2$, $3$, and $4$.
  • Figure 2: Granular performance and reasoning token usage evaluation across domains and difficulty tiers of gpt-4o, o1-mini, o3-mini (m) and o3-mini (h) on the Omni-MATH benchmark. The heatmaps visualize cross-sectional performance scores on a $0$-$100\%$ scale, represented by the color of the progress bar. The length of the progress bar in each cell represents relative token usage for the test-time scaled models. The extra column is computed by averaging over the rows. The extra row and "average" cell are computed independently to give equal weight to multi-domain questions (see \ref{['sec:Methods']}). This figure shows that models allocate more computational resources to problems that require complex combinatorial reasoning (Geometry, Discrete Mathematics and Number Theory), whereas foundational arithmetic and algebra problems demand relatively fewer resources. On average, token usage scales with difficulty level.
  • Figure 3: Analysis of the reasoning token distribution, evolution of token region accuracy, and consistency between difficulty tiers and token usage for o1-mini, o3-mini (m) and o3-mini (h). The main panels of the figure display the distribution of the reasoning tokens as a stacked histogram, illustrating the proportion of correctly and incorrectly answered questions in the Omni-MATH dataset by o1-mini, o3-mini (m) and o3-mini (h). The secondary $y$-axis depicts the probability that the model answers incorrectly given that the token count has surpassed the bin threshold (see \ref{['sec:Methods']}). The panels below the histogram contain a filled histogram where the color opacity represents the difficulty level of the math questions (cfr. \ref{['fig:methodology-2']}). The figure shows that o1-mini and o3-mini (m) have a similar reasoning token distribution, with o3-mini (m) giving more correct answers for high-token regions. o3-mini (h) has a good ratio of correct vs. incorrect answers, even for very high token counts. The probability of giving an incorrect answer increases with token count for all models. Finally, the relative proportion of tier levels in each bin reveal a clear transition from a region where the majority of the questions come from the lowest tiers to a region where the majority of the questions come from the highest tiers (for bins with a sufficient amount of data points).
  • Figure 4: o3 (mini) thinks harder, not longer. This figure shows that o3-mini (m) does not require longer reasoning chains than o1-mini to achieve better accuracy and that, in general, more proficient models exhibit less accuracy decay as reasoning tokens increase. a, Accuracy per reasoning token, computed by dividing the number of correctly answered questions by the total number of questions in each bin of the histograms in \ref{['fig:main_3']}. Accuracy declines as reasoning token usage increases. Furthermore, we observe that the slope of the lines becomes flatter for higher performing models. These effects are further quantified in the regression analysis (see \ref{['sec:Methods']}). b, The boxplots show the distribution of the reasoning tokens for correctly answered questions. Further investigation in the left panel of \ref{['fig:appendix-3']} confirms that o1-mini and o3-mini (m) have a very similar token distribution. The token distribution of o3-mini (h) is stretched linearly with respect to the one of o3-mini (m) (\ref{['fig:appendix-3']} right). c, Stratifying plot a by difficulty level shows that, within difficulty tiers, accuracy also decreases with higher reasoning token usage. This suggests that the number of reasoning tokens, rather that difficulty level alone, can be used as a signal for the correctness of the model's answer. In \ref{['fig:appendix-2']}, we show this also holds when stratifying across domains.
  • Figure A1: Sample problem from the Omni-MATH dataset. The Omni-MATH dataset consists of $4428$ Olympiad-level math problems together with an exact answer, a written out solution and metadata Domain, Difficulty and Source.
  • ...and 7 more figures