Table of Contents
Fetching ...

Rethinking Thinking Tokens: LLMs as Improvement Operators

Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal

TL;DR

This work tackles the high cost of long chain-of-thought reasoning by introducing SR and PDR, two inference-time operators that keep per-call context short while accumulating evidence across rounds via compact textual workspaces. PDR, in particular, generates diverse drafts in parallel, distills them into a bounded workspace, and refines conditioned on that workspace, enabling improved accuracy at fixed latency and with controllable total compute. The authors formalize the read-write-compress cycle and token budgets, and they train an 8B model with operator-consistent RL to align training with the PDR inference method. Experiments on AIME 2024/2025 show SR and especially PDR outperform long-CoT at matched sequential budgets, with gains up to +11% and +9%, and operator-consistent RL provides additional improvements, suggesting that short-context iteration with compact summaries can substitute for long reasoning traces and that meta-learning the improvement operator further shifts the accuracy-versus-compute Pareto frontier.

Abstract

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

Rethinking Thinking Tokens: LLMs as Improvement Operators

TL;DR

This work tackles the high cost of long chain-of-thought reasoning by introducing SR and PDR, two inference-time operators that keep per-call context short while accumulating evidence across rounds via compact textual workspaces. PDR, in particular, generates diverse drafts in parallel, distills them into a bounded workspace, and refines conditioned on that workspace, enabling improved accuracy at fixed latency and with controllable total compute. The authors formalize the read-write-compress cycle and token budgets, and they train an 8B model with operator-consistent RL to align training with the PDR inference method. Experiments on AIME 2024/2025 show SR and especially PDR outperform long-CoT at matched sequential budgets, with gains up to +11% and +9%, and operator-consistent RL provides additional improvements, suggesting that short-context iteration with compact summaries can substitute for long reasoning traces and that meta-learning the improvement operator further shifts the accuracy-versus-compute Pareto frontier.

Abstract

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

Paper Structure

This paper contains 27 sections, 9 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: AIME 2024: Accuracy vs. sequential budget $B_{\mathrm{seq}}$. We compare Long CoT, SR, and PDR; the dashed curve is an oracle upper bound (Oracle-PDR) that perfectly transmits any correct solutions under the same budgets to the compact workspace.
  • Figure 2: (a) Parallel-Distill-Refine (PDR). In round $r$, the model generates $M_r$ parallel drafts, then distills them into a compact workspace using one of the schemes in (b); the refined state seeds the next round. (b) Distillation schemes used to build the workspace (e.g., global summary, shared top-$k$, per-sample top-$k$, random-$k$). (c) Three inference regimes. Top-Long chain-of-thought: a single, long trace. Middle-Sequential Refinement (SR): one draft updated over short rounds. Bottom-PDR: each round spawns $M_r$ drafts, distills into a workspace, and refines. The example shows a 3-round configuration $M=(8,4,1)$ (configuration is a hyperparameter, and any other choice is possible). Across panels, the per-call sequential budget$B_{\mathrm{seq}}$ (latency proxy) is held fixed, while PDR increases total compute$B_{\mathrm{total}}$ via parallelism without increasing per-call context.
  • Figure 3: AIME 2024: Iterative improvement beats single-pass long-CoT at matched sequential budgets. The $x$-axis reports $B_{\mathrm{seq}}$: the thinking tokens consumed along the accepted path of the iterative chain, plus any distilled summary that conditions the next step. Tokens spent on unused parallel proposals are excluded, so $B_{\mathrm{seq}}$ serves as a latency proxy. At comparable $B_{\mathrm{seq}}$, both SR and PDR outperform the single-pass long CoT baseline, with PDR yielding the largest gains by converting additional total compute (via parallelism) into accuracy without increasing per-call context.
  • Figure 4: Token Budgets comparison: We plot all the different configurations for Long CoT, SR and PDR operators for both $B_{\text{seq}}$ and $B_{\text{total}}$ token budgets for gemini-2.5-flash. For $B_{\text{seq}}$, PDR forms the Pareto-frontier and gives consistent gains over Long CoT and SR. However, for $B_{\text{total}}$, SR forms the pareto-frontier because there are no parallel drafts involved so no generations are discarded.
  • Figure 5: AIME 2024: Long CoT, SR, and PDR at thinking budget of 24576. The $x$-axis reports $B_{\mathrm{seq}}$: the thinking tokens consumed along the accepted path of the iterative chain, plus any distilled summary that conditions the next step. Tokens spent on unused parallel proposals are excluded, so $B_{\mathrm{seq}}$ serves as a latency proxy. At a token matched budget of $442k$ tokens, SR has a score of 90.4 but $B_{\mathrm{seq}}$ of $442k$, whereas PDR has a score of 90.6 but $B_{\mathrm{seq}}$ of $172k$ tokens.
  • ...and 5 more figures