Table of Contents
Fetching ...

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

Yurun Yuan, Tengyang Xie

Abstract

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

Abstract

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.
Paper Structure (49 sections, 3 theorems, 30 equations, 11 figures, 9 tables, 3 algorithms)

This paper contains 49 sections, 3 theorems, 30 equations, 11 figures, 9 tables, 3 algorithms.

Key Result

Proposition 1

[proposition]prop:non-markov For the paradigm of action-sequence-based learning, suppose asm:opt-error and asm:def-eps-stat hold, then we have

Figures (11)

  • Figure 1: The Combination Lock problem with horizon $H$. At each state $S_h$, the correct action $A_h^+$ advances the agent to the next state; the incorrect action $A_h^-$ resets it to the starting position.
  • Figure 2: Comparison of Markov agent and action-sequence agent on Combination Lock task. We focus on two metrics: the success rate in reaching the final goal state and the furthest state reached before the agent triggers an incorrect action.
  • Figure 3: Comparison of action-sequence-based models and our Markovian Models. For action-sequence models, the new action ($a_h$) is appended to the existing action history and fed back into the model as the input for the subsequent prediction. For Markov models, the next action $a_h$ is combined with the current state $s_h$ and passed through a state transition function $\mathcal{P}$ to obtain the next state $s_{h+1}$, which is the input for the subsequent prediction.
  • Figure 4: Pass@$k$ accuracy for Qwen3-4B-based models on Futoshiki. While action-sequence models rarely improve SFT Pass@$k$, Markov models consistently surpass their base models' limits.
  • Figure 5: Training reward curves. Markov models reach higher rewards in fewer training steps, providing empirical evidence of lower sample complexity.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Proposition 1: Performance guarantee of action-sequence-based learning
  • Proposition 2: Performance guarantee of approximate Markovian learning
  • Lemma 3: Performance difference lemma
  • proof : Proof of \ref{['prop:non-markov']}
  • proof : Proof of \ref{['prop:app-markov']}
  • proof : Proof of \ref{['lemma:perf-diff']}