Table of Contents
Fetching ...

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kianté Brantley, Wen Sun

TL;DR

The paper tackles the challenge of aligning and enhancing LLMs during post-training by revisiting KL-regularized RL in deterministic MDPs. It introduces $Q\sharp$, a value-based method that guides the reference policy with the optimal regularized $Q$-function $Q^{\star,\eta}$, learned via distributional RL on aggregated data, avoiding backpropagation through the large model. The authors establish that $Q^{\star,\eta}$ can be expressed as a function of the cumulative reward distribution $Z^{\star}$, enabling a DAgger-like training loop that iteratively improves the distributional critic without TD updates and yields second-order, variance-dependent PAC guarantees under realizability. Empirically, $Q\sharp$ improves math-reasoning performance and reduces KL divergence relative to baselines, demonstrated on Star-Graph, GSM8K, MATH, and cross-model settings, while allowing inference with multiple $\eta$ and using small value models to guide large ref-Policy generations. Overall, the approach offers a principled, efficient, and scalable path for post-training LLM alignment with theoretical guarantees and practical gains.

Abstract

Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

TL;DR

The paper tackles the challenge of aligning and enhancing LLMs during post-training by revisiting KL-regularized RL in deterministic MDPs. It introduces , a value-based method that guides the reference policy with the optimal regularized -function , learned via distributional RL on aggregated data, avoiding backpropagation through the large model. The authors establish that can be expressed as a function of the cumulative reward distribution , enabling a DAgger-like training loop that iteratively improves the distributional critic without TD updates and yields second-order, variance-dependent PAC guarantees under realizability. Empirically, improves math-reasoning performance and reduces KL divergence relative to baselines, demonstrated on Star-Graph, GSM8K, MATH, and cross-model settings, while allowing inference with multiple and using small value models to guide large ref-Policy generations. Overall, the approach offers a principled, efficient, and scalable path for post-training LLM alignment with theoretical guarantees and practical gains.

Abstract

Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce , a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized function. We propose to learn the optimal function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized -values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.

Paper Structure

This paper contains 33 sections, 13 theorems, 26 equations, 13 figures, 7 tables, 2 algorithms.

Key Result

Theorem 2.2

Under asm:deterministic-mdp, we have $V^{\star,\eta}_h(x_h)=\eta\ln\mathbb{E}_{\pi^{\normalfont\text{ref}}}[\exp(\eta^{-1}\sum_{t\geq h}r_t(x_t,y_t))\mid x_h]$ and $Q^{\star,\eta}_h(x_h,y_h)=\eta\ln\mathbb{E}_{\pi^{\normalfont\text{ref}}}[\exp(\eta^{-1}\sum_{t\geq h}r_t(x_t,y_t))\mid x_h,y_h]$.

Figures (13)

  • Figure 1: (Left) A sketch of our post-training algorithm ($Q\sharp$) based on distributional RL. $Q\sharp$ alternates between learning $Z^\star$ -- the reward-to-go distribution of $\pi^{\normalfont\text{ref}}$ -- and using the induced policy to collect new data and further improve the distributional estimate. (Right) Evaluation result on the GSM8K dataset cobbe2021training. We see that $Q\sharp$ achieves both higher accuracy and lower KL compared to prior value-based post-training algorithms mudgal2023controlledhan2024value.
  • Figure 2: (a) The star-graph with degree $d=5$ and path length $\ell=5$. Next-token prediction, the standard pre-training loss, learns a sub-optimal shortcut that selects a random first node and follows it until the end. (b) Accuracies on held-out test graphs for various models. Policy-based post-training methods such as REINFORCE and RPO (a variant of DPO) still exhibit the shortcut and has test accuracy $1/d$, while our value-based $Q\sharp$ fixes the shortcut and achieves near-perfect accuracy.
  • Figure 3: Performance tradeoff between CD and $Q\sharp$ on the GSM8K validation set. (Left) pass@1 vs. KL divergence: $Q\sharp$ outperforms CD with higher accuracy and lower KL. (Right) KL divergence across $\eta$: CD's KL increases rapidly and deviates from $\pi^{\normalfont\text{ref}}$, while $Q\sharp$ remains stable.
  • Figure 4: A tree MDP where edges are labeled with $\pi^{\normalfont\text{ref}}$'s action probability. $\pi^{\normalfont\text{ref}}$ goes to the left sub-tree w.p. $p_L$ and the right sub-tree w.p. $p_R$, where $p_L,p_R > 0$. The left sub-tree gives $r=0.1$ w.p. $1$. In the right sub-tree, $\pi^{\normalfont\text{ref}}$ chooses reward $1$ w.p. $0.05$ and chooses reward $0$ w.p. $0.95$.
  • Figure 5: Full results for star-graph experiments. The empty bar for $G(5,5)$ and $G(3,8)$ are for DPO, which pushed down both the chosen and reject paths resulting in $0$ accuracy.
  • ...and 8 more figures

Theorems & Definitions (22)

  • Theorem 2.2
  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof
  • Theorem 4.4
  • Corollary 4.5
  • Theorem B.1
  • proof
  • Lemma B.1: Donsker-Varadhan's Variational Formula; donsker1983asymptotic
  • ...and 12 more