Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre Kıcıman, Songwu Lu, Ranveer Chandra
TL;DR
This work tackles the challenge of training LLMs on open-ended tasks where rewards are not directly verifiable. It introduces Direct Reasoning Optimization (DRO), a constrained RL framework that pairs a token-level Reasoning Reflection Reward (R3) with rubric-gated feasibility checks, built on Group Relative Policy Optimization (GRPO). R3 focuses the learning signal on reasoning-reflective tokens by leveraging token self-certainty distributions across CoT prefixes, while rubric gating enforces hard final-answer constraints and mitigates reward hacking. Empirical results across four diverse datasets show that DRO improves performance, accelerates learning, and produces outputs that satisfy feasibility criteria, demonstrating the value of combining dense reasoning signals with principled task constraints for open-ended generation.
Abstract
RL training of LLMs on open-ended tasks is challenging due to the lack of direct verifiability. In this paper, we frame such training as constrained RL that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its CoT reasoning prefix while selectively emphasizing reasoning-reflective tokens to capture how likely the generated reasoning is to yield the desired answer. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets, our framework outperforms baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.
