Learning Structured Reasoning via Tractable Trajectory Control

Po-Nien Kung; Zhen Yang; Jeffrey Luo; Cheng-Fu Yang; Haikang Deng; Zi-Yi Dou; Yinfei Yang; Nanyun Peng; Zhe Gan; Kai-Wei Chang

Learning Structured Reasoning via Tractable Trajectory Control

Po-Nien Kung, Zhen Yang, Jeffrey Luo, Cheng-Fu Yang, Haikang Deng, Zi-Yi Dou, Yinfei Yang, Nanyun Peng, Zhe Gan, Kai-Wei Chang

TL;DR

Ctrl-R is proposed, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving.

Abstract

Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., "wait," indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.

Learning Structured Reasoning via Tractable Trajectory Control

TL;DR

Abstract

Paper Structure (45 sections, 52 equations, 7 figures, 5 tables)

This paper contains 45 sections, 52 equations, 7 figures, 5 tables.

Introduction
Preliminaries
Policy Gradients and Off-Policy Learning
PPO and the Clipped Surrogate Objective
Problem Formulation: Learning Structured Reasoning
Method
Overview
Structured Reasoning as Constrained Decoding
The Ctrl-R Framework
Optimization Objective
Design Choices
Tractable Guidance
Decouple Proximal Policy Optimization
Power-Scaled Importance Weights
Ctrl-R Implementation
...and 30 more sections

Figures (7)

Figure 1: Examples of cognitive behaviors. Although cognitive behaviors are implicit, they often manifest through recurring lexical patterns during the reasoning process. We refer to reasoning that exhibits such patterns as structured reasoning.
Figure 2: Example of sampling a guided trajectory under Ctrl-R. We first sample a constraint $\alpha$, then use the HMM to compute the marginal guidance $\gamma(\alpha \mid x_{<t}, x_t)$, which is combined with the proximal policy at each decoding step to form the guided behavior policy $\mu_{\alpha}$. We illustrate token-by-token decoding and visualize the guidance effect $w$ using blue, white, and red dots. Before the constraint is satisfied, sampled tokens may be dominated by either the proximal policy (white, $w>1$) or the guidance function $\gamma$ (blue, $w<1$). Once the constraint is satisfied, the behavior policy collapses to the proximal policy, with $w=1$ (red).
Figure 3: Trends in reasoning module usage during training and evaluation. The top panel shows rollout-time behavior. Keyword Usage (Strict) measures strict regex matches of the enforced keyphrases defined in \ref{['tab:reasoning-controls']}. Keyword Usage (Loose) captures lexical variants via expanded regex patterns (\ref{['appendix:loose-keywords']}). Average Scores indicate the accuracy of outputs exhibiting the corresponding reasoning patterns, as identified by loose keyword matches. All figures highlight the relative percentage change of Ctrl-R compared to the DAPO setting.
Figure 4: Early Training Efficiency Comparison: The plot compares the reward/accuracy over time for our Medium Control method (red) against Baseline and other control settings. Our method demonstrates a +14.0% higher AUC in the first 150 steps (highlighted as the "Early Efficiency Zone"), indicating significantly faster convergence during the initial exploration phase.
Figure 5: Deviation in keyword accuracy relative to base average across training steps. Four panels show the accuracy improvement or decline associated with specific keywords during progressive training intervals (Steps 1-100 through 301-330). The data highlights how the influence of keywords across General, Reflection, and VisualGrounding categories shifts relative to the evolving base accuracy of the model.
...and 2 more figures

Learning Structured Reasoning via Tractable Trajectory Control

TL;DR

Abstract

Learning Structured Reasoning via Tractable Trajectory Control

Authors

TL;DR

Abstract

Table of Contents

Figures (7)