Table of Contents
Fetching ...

When Can Model-Free Reinforcement Learning be Enough for Thinking?

Josiah P. Hanna, Nicholas E. Corrado

TL;DR

The work introduces Thought MDPs to formalize thinking as internal, reward-free actions that can steer future environment actions. It proves that thinking emerges as a policy-improvement mechanism contingent on initial policy structure and links this to the effective horizon in goal MDPs. Empirically, it validates the theory in LLM reasoning tasks where forcing step-by-step thinking boosts performance, and it demonstrates a non-language toy domain where multi-task pre-training and designated thought actions yield more data-efficient RL. Together, these results illuminate when and how model-free RL may develop deliberative thinking and outline avenues for extending thinking beyond language to broader AI agents. The findings have implications for designing agents that can leverage internal reasoning as a controllable, reward-driven process.

Abstract

Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to such "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a thought Markov decision process (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.

When Can Model-Free Reinforcement Learning be Enough for Thinking?

TL;DR

The work introduces Thought MDPs to formalize thinking as internal, reward-free actions that can steer future environment actions. It proves that thinking emerges as a policy-improvement mechanism contingent on initial policy structure and links this to the effective horizon in goal MDPs. Empirically, it validates the theory in LLM reasoning tasks where forcing step-by-step thinking boosts performance, and it demonstrates a non-language toy domain where multi-task pre-training and designated thought actions yield more data-efficient RL. Together, these results illuminate when and how model-free RL may develop deliberative thinking and outline avenues for extending thinking beyond language to broader AI agents. The findings have implications for designing agents that can leverage internal reasoning as a controllable, reward-driven process.

Abstract

Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to such "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a thought Markov decision process (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.

Paper Structure

This paper contains 38 sections, 7 theorems, 6 equations, 3 figures, 2 tables.

Key Result

Proposition 4

Any optimal policy, $\pi^\star$, for a thought MDP does not take thought actions: $\pi^\star(s,\tau) \in {\mathcal{A}}$, $\forall s \in {\mathcal{S}}, \tau \in {\mathcal{T}}$.

Figures (3)

  • Figure 1: (left) An example thought MDP with $|{\mathcal{T}}|=2$. We use $|{\mathcal{S}}|=10$ in our illustrative results. The agent receives a reward when it reaches the goal environment state on the far right. The agent can move left or right in the environment state space and up and down in the thought state space. We use $\gamma=0.9$ for both thinking and non-thinking time-steps. (right) Evolution of the policy and state values for 1, 4, and 10 iterations of policy iteration. The policy is initialized as shown on the left. Colors indicate value and arrows indicate the action that the policy would take.
  • Figure 2: Mean learning curves for the four agents we train in the gridworld environment. The vertical axis gives the success rate for first navigating to the bottom right and then the top left corner. The horizontal axis is the iteration of policy improvement (200 episodes are collected at each iteration). We run 5 trials for each learning agent and shading indicates a 95% bootstrap confidence interval. Light lines show individual training runs.
  • Figure 3: Mean learning curves for the four agents we train in the gridworld environment with misspecified sub-tasks. The vertical axis gives the success rate for first navigating to the bottom right and then the top left corner. The horizontal axis is the iteration of policy improvement (200 episodes are collected at each iteration). We run 10 trials for each learning agent and shading indicates a 95% bootstrap confidence interval. Light lines show individual training runs.

Theorems & Definitions (14)

  • Proposition 4
  • proof
  • Theorem 5
  • proof
  • Definition 6
  • Proposition 6
  • Corollary 7
  • proof
  • Proposition 7
  • proof
  • ...and 4 more