Table of Contents
Fetching ...

Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks

Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre Kıcıman, Songwu Lu, Ranveer Chandra

TL;DR

This work tackles the challenge of training LLMs on open-ended tasks where rewards are not directly verifiable. It introduces Direct Reasoning Optimization (DRO), a constrained RL framework that pairs a token-level Reasoning Reflection Reward (R3) with rubric-gated feasibility checks, built on Group Relative Policy Optimization (GRPO). R3 focuses the learning signal on reasoning-reflective tokens by leveraging token self-certainty distributions across CoT prefixes, while rubric gating enforces hard final-answer constraints and mitigates reward hacking. Empirical results across four diverse datasets show that DRO improves performance, accelerates learning, and produces outputs that satisfy feasibility criteria, demonstrating the value of combining dense reasoning signals with principled task constraints for open-ended generation.

Abstract

RL training of LLMs on open-ended tasks is challenging due to the lack of direct verifiability. In this paper, we frame such training as constrained RL that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its CoT reasoning prefix while selectively emphasizing reasoning-reflective tokens to capture how likely the generated reasoning is to yield the desired answer. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets, our framework outperforms baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks

TL;DR

This work tackles the challenge of training LLMs on open-ended tasks where rewards are not directly verifiable. It introduces Direct Reasoning Optimization (DRO), a constrained RL framework that pairs a token-level Reasoning Reflection Reward (R3) with rubric-gated feasibility checks, built on Group Relative Policy Optimization (GRPO). R3 focuses the learning signal on reasoning-reflective tokens by leveraging token self-certainty distributions across CoT prefixes, while rubric gating enforces hard final-answer constraints and mitigates reward hacking. Empirical results across four diverse datasets show that DRO improves performance, accelerates learning, and produces outputs that satisfy feasibility criteria, demonstrating the value of combining dense reasoning signals with principled task constraints for open-ended generation.

Abstract

RL training of LLMs on open-ended tasks is challenging due to the lack of direct verifiability. In this paper, we frame such training as constrained RL that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its CoT reasoning prefix while selectively emphasizing reasoning-reflective tokens to capture how likely the generated reasoning is to yield the desired answer. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets, our framework outperforms baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

Paper Structure

This paper contains 35 sections, 15 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: DRO at a glance: constrained RL for open-ended tasks.$\bf{R3}$ provides a dense, token-level reward for CoT traces emphasizing reasoning-reflective tokens; rubric supervision enforces hard lexical and semantic constraints on final answers through rollout group-level rejection; and variance-based filtering filters out low-variance rollout groups with insufficient signal for comparative learning.
  • Figure 2: Illustrative example of Reasoning Reflection Reward ($\bf{R3}$). For the paper revision task, the model is prompted to revise a paragraph based on reviewer comments (upper left). $\bf{R3}$ computes per-token self-certainty (log-probabilities) in the reference revision (upper right) for each sampled reasoning trace, and highlights reasoning-reflective tokens using $\sigma(\text{certainty})$. In this example, Reasoning A correctly identifies that Section 4 (overview) has been moved earlier and adjusts the paragraph structure accordingly, with a minor omission of section numbers. Reasoning B gives up. While a vanilla aggregate of certainty prefers B over A due to A's lower certainty on the token "2", $\bf{R3}$ successfully aligns with the desired ranking by up-weighting high-$\sigma(\text{certainty})$ tokens "gives", "existing" and "." that better reflect reasoning effectiveness.
  • Figure 3: Simulation results comparing correlation between rollout advantages and reasoning‑reflective token signals under different aggregation methods. Plain average log‑probability exhibits severe degradation due to noisy low‑probability tokens, average probability performs moderately better but still suffers dilution, while $\bf{R3}$ consistently maintains higher correlation across sequence lengths and reflective‑token fractions.
  • Figure 4: Rubric rejection rate over training. The curve shows the cumulative fraction of total rejections. On ParaRev and RaR-Medicine, it flattens over training, indicating fewer rejections and improved rubric satisfaction. On ContractNLI, the curve is closer to linear, suggesting limited improvement, consistent with downstream results.
  • Figure 5: Policy entropy collapse during training on ParaRev without rubric supervision.
  • ...and 1 more figures