$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Jin Peng Zhou; Kaiwen Wang; Jonathan Chang; Zhaolin Gao; Nathan Kallus; Kilian Q. Weinberger; Kianté Brantley; Wen Sun

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kianté Brantley, Wen Sun

TL;DR

The paper tackles the challenge of aligning and enhancing LLMs during post-training by revisiting KL-regularized RL in deterministic MDPs. It introduces $Q\sharp$, a value-based method that guides the reference policy with the optimal regularized $Q$-function $Q^{\star,\eta}$, learned via distributional RL on aggregated data, avoiding backpropagation through the large model. The authors establish that $Q^{\star,\eta}$ can be expressed as a function of the cumulative reward distribution $Z^{\star}$, enabling a DAgger-like training loop that iteratively improves the distributional critic without TD updates and yields second-order, variance-dependent PAC guarantees under realizability. Empirically, $Q\sharp$ improves math-reasoning performance and reduces KL divergence relative to baselines, demonstrated on Star-Graph, GSM8K, MATH, and cross-model settings, while allowing inference with multiple $\eta$ and using small value models to guide large ref-Policy generations. Overall, the approach offers a principled, efficient, and scalable path for post-training LLM alignment with theoretical guarantees and practical gains.

Abstract

Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

TL;DR

The paper tackles the challenge of aligning and enhancing LLMs during post-training by revisiting KL-regularized RL in deterministic MDPs. It introduces

, a value-based method that guides the reference policy with the optimal regularized

-function

, learned via distributional RL on aggregated data, avoiding backpropagation through the large model. The authors establish that

can be expressed as a function of the cumulative reward distribution

, enabling a DAgger-like training loop that iteratively improves the distributional critic without TD updates and yields second-order, variance-dependent PAC guarantees under realizability. Empirically,

improves math-reasoning performance and reduces KL divergence relative to baselines, demonstrated on Star-Graph, GSM8K, MATH, and cross-model settings, while allowing inference with multiple

and using small value models to guide large ref-Policy generations. Overall, the approach offers a principled, efficient, and scalable path for post-training LLM alignment with theoretical guarantees and practical gains.

Abstract

, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized

function. We propose to learn the optimal

function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized

-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically,

outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight

as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

TL;DR

Abstract

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (22)