Table of Contents
Fetching ...

The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy

TL;DR

This work tackles the quadratic compute bottleneck in RL-based reasoning for LLMs by decoupling thinking length from context through Markovian Thinking. It introduces Delethink, a chunked RL environment where reasoning proceeds in fixed-size chunks with a short carryover, forcing the model to maintain a bounded textual state across boundaries. Empirically, Delethink matches or exceeds LongCoT-RL performance at similar budgets and continues to improve with test-time scaling, achieving substantial compute savings (linear in thinking length) and enabling much longer effective reasoning traces. The results suggest that large reasoning LLMs exhibit Markovian traces even zero-shot, offering a practical path to scalable, efficient reasoning without architectural changes, and point toward complementary opportunities with linear-attention and state-based RL formulations.

Abstract

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

TL;DR

This work tackles the quadratic compute bottleneck in RL-based reasoning for LLMs by decoupling thinking length from context through Markovian Thinking. It introduces Delethink, a chunked RL environment where reasoning proceeds in fixed-size chunks with a short carryover, forcing the model to maintain a bounded textual state across boundaries. Empirically, Delethink matches or exceeds LongCoT-RL performance at similar budgets and continues to improve with test-time scaling, achieving substantial compute savings (linear in thinking length) and enabling much longer effective reasoning traces. The results suggest that large reasoning LLMs exhibit Markovian traces even zero-shot, offering a practical path to scalable, efficient reasoning without architectural changes, and point toward complementary opportunities with linear-attention and state-based RL formulations.

Abstract

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

Paper Structure

This paper contains 58 sections, 26 equations, 23 figures, 2 tables, 1 algorithm.

Figures (23)

  • Figure 1: Delethink redefines the thinking RL environment as a chunked, markovian process: generation proceeds in fixed-size chunks, and at each boundary the environment resets the context to a fresh prompt containing the query plus a short carryover from the previous chunk. This forces the policy to learn to progress across chunks by maintaining a textual state, creating a Markovian Thinker. In contrast, the LongCoT environment concatenates tokens indefinitely, so the state (and model context) grows with the trace.
  • Figure 2: (a) Delethink 24K (a Markovian Thinker) matches and surpasses LongCoT-RL 24K in accuracy during RL training while using less compute; both methods improve as the thinking budget scales from 8K to 24K. (b) Beyond the trained thinking budget, Delethink significantly outperforms and keeps improving while others plateau; within the budget, Delethink and LongCoT-RL 24K scale similarly with test-time compute (reported using sequential sampling). (c) Training cost of R1-Distill 1.5B vs. average thinking length with an optimized stack of verl sheng2024verl + SGLang zheng2023sglang on H100s: quadratic for LongCoT and linear for Delethink, as predicted.
  • Figure 3: Computational profiles of LongCoT-RL and Delethink scaling from $n$ to $nS$ tokens.
  • Figure 4: (Left) On IID math tasks (AIME’24/’25, HMMT’25), Delethink outperforms LongCoT-RL 24K. Shaded regions show gains from test-time scaling (through sequential sampling), where Delethink improves the performance even more; on OOD tasks (GPQA-Diamond, LiveCodeBench) gains are modest, yet Delethink still matches or slightly beats LongCoT-RL 24K. (Right) Per-GPU rollout throughput during RL training (R1-Distill 1.5B, H100 cluster). Delethink’s RL environment design keeps peak memory constant, sustaining throughput as thinking scales; LongCoT’s memory grows linearly, driving throughput down at longer budgets.
  • Figure 5: (Left) Smoothed entropy over RL steps for Delethink and LongCoT-RL. Both remain roughly flat and non-collapsing cui2025entropy, indicating stable learning. Note that rising entropy typically precedes divergence. (Right) Delethink and LongCoT use their thinking budgets well. At longer lengths, Delethink produces more correct answers, showing it spends its budget effectively.
  • ...and 18 more figures