Table of Contents
Fetching ...

CTRLS: Chain-of-Thought Reasoning via Latent State-Transition

Junda Wu, Yuxin Xiong, Xintong Li, Sheldon Yu, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, Julian McAuley

TL;DR

CTRLS reframes chain-of-thought as latent-state transitions within a Markov decision process, enabling principled, transition-aware exploration of reasoning trajectories. It combines a variational latent-state encoder, a state-conditioned generator, and a transition model trained with a unified ELBO objective and on-policy reinforcement learning using a distributional policy over latent transitions. The approach yields improvements in reasoning accuracy, diversity, and exploration efficiency on standard math benchmarks, while enhancing explainability through explicit latent dynamics and self-reflective validation. This framework offers a principled path toward more verifiable and robust reasoning in large language models.

Abstract

Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.

CTRLS: Chain-of-Thought Reasoning via Latent State-Transition

TL;DR

CTRLS reframes chain-of-thought as latent-state transitions within a Markov decision process, enabling principled, transition-aware exploration of reasoning trajectories. It combines a variational latent-state encoder, a state-conditioned generator, and a transition model trained with a unified ELBO objective and on-policy reinforcement learning using a distributional policy over latent transitions. The approach yields improvements in reasoning accuracy, diversity, and exploration efficiency on standard math benchmarks, while enhancing explainability through explicit latent dynamics and self-reflective validation. This framework offers a principled path toward more verifiable and robust reasoning in large language models.

Abstract

Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.

Paper Structure

This paper contains 24 sections, 1 theorem, 25 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Theorem 5.3

Consider a latent‐state generative model with joint density $P_{\omega,\theta}(x_{1:T},z_{1:T}) =\prod_{t=1}^{T} P_{\omega}\bigl(x_t\mid x_{<t},z_{\le t}\bigr) P_{\theta}\bigl(z_t\mid x_{<t},z_{<t}\bigr)$, and let $Q_{\phi}(z_{1:T}\mid x_{1:T}) =\prod_{t=1}^{T}Q_{\phi}\bigl(z_t\mid x_{\le t}\bigr)$ Equality holds if and only if $Q_{\phi}(z_{1:T}\mid x_{1:T}) = P_{\theta}(z_{1:T}\mid x_{1:T})$, i.

Figures (7)

  • Figure 1: Illustration of the difference between conventional CoT prompting and CTRLS.
  • Figure 2: An overview of the proposed two-phase alignment and fine-tuning scheme.
  • Figure 3: Qualitative comparison. CTRLS correctly verifies candidate solutions and filters out invalid cases, while the baseline fails to check whether the resulting value is truly prime.
  • Figure 4: On-policy learning curves for CTRLS with LlaMA3.2 and Qwen2.5.
  • Figure 5: On-policy reinforcement learning curves for LlaMA3.2 and Qwen2.5 on MATH dataset.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 5.1
  • Definition 5.2
  • Theorem 5.3: Evidence Lower-Bound