Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

Yurun Yuan; Tengyang Xie

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

Yurun Yuan, Tengyang Xie

Abstract

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

Abstract

Paper Structure (49 sections, 3 theorems, 30 equations, 11 figures, 9 tables, 3 algorithms)

This paper contains 49 sections, 3 theorems, 30 equations, 11 figures, 9 tables, 3 algorithms.

Introduction
Preliminaries
Markov Decision Process, Policies, and Value Functions
Reinforcement Learning for Language Models
Reintroducing Markov States to LLM Post-Training
Limits of Current RL for LLMs
A Didactic Example
Markov States in LLM Post-Training
Markov State Estimation
Empirical Evidence
Broader Implications and Applications
Experiments
Tasks and Datasets
Models and Training Pipelines
Implementation Details
...and 34 more sections

Key Result

Proposition 1

[proposition]prop:non-markov For the paradigm of action-sequence-based learning, suppose asm:opt-error and asm:def-eps-stat hold, then we have

Figures (11)

Figure 1: The Combination Lock problem with horizon $H$. At each state $S_h$, the correct action $A_h^+$ advances the agent to the next state; the incorrect action $A_h^-$ resets it to the starting position.
Figure 2: Comparison of Markov agent and action-sequence agent on Combination Lock task. We focus on two metrics: the success rate in reaching the final goal state and the furthest state reached before the agent triggers an incorrect action.
Figure 3: Comparison of action-sequence-based models and our Markovian Models. For action-sequence models, the new action ($a_h$) is appended to the existing action history and fed back into the model as the input for the subsequent prediction. For Markov models, the next action $a_h$ is combined with the current state $s_h$ and passed through a state transition function $\mathcal{P}$ to obtain the next state $s_{h+1}$, which is the input for the subsequent prediction.
Figure 4: Pass@$k$ accuracy for Qwen3-4B-based models on Futoshiki. While action-sequence models rarely improve SFT Pass@$k$, Markov models consistently surpass their base models' limits.
Figure 5: Training reward curves. Markov models reach higher rewards in fewer training steps, providing empirical evidence of lower sample complexity.
...and 6 more figures

Theorems & Definitions (6)

Proposition 1: Performance guarantee of action-sequence-based learning
Proposition 2: Performance guarantee of approximate Markovian learning
Lemma 3: Performance difference lemma
proof : Proof of \ref{['prop:non-markov']}
proof : Proof of \ref{['prop:app-markov']}
proof : Proof of \ref{['lemma:perf-diff']}

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

Abstract

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

Authors

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (6)