The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

Milad Aghajohari; Kamran Chitsaz; Amirhossein Kazemnejad; Sarath Chandar; Alessandro Sordoni; Aaron Courville; Siva Reddy

The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy

TL;DR

This work tackles the quadratic compute bottleneck in RL-based reasoning for LLMs by decoupling thinking length from context through Markovian Thinking. It introduces Delethink, a chunked RL environment where reasoning proceeds in fixed-size chunks with a short carryover, forcing the model to maintain a bounded textual state across boundaries. Empirically, Delethink matches or exceeds LongCoT-RL performance at similar budgets and continues to improve with test-time scaling, achieving substantial compute savings (linear in thinking length) and enabling much longer effective reasoning traces. The results suggest that large reasoning LLMs exhibit Markovian traces even zero-shot, offering a practical path to scalable, efficient reasoning without architectural changes, and point toward complementary opportunities with linear-attention and state-based RL formulations.

Abstract

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

TL;DR

Abstract

The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)