Table of Contents
Fetching ...

RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

Tao Ren, Jinyang Jiang, Hui Yang, Wan Tian, Minhao Zou, Guanghao Li, Zishi Zhang, Qinghao Wang, Shentao Qin, Yanjun Zhao, Rui Tao, Hui Shao, Yijie Peng

TL;DR

RiskPO introduces a distributional, risk-averse approach to RL with verifiable rewards for post-training large language models. By replacing mean-based objectives with Mixed Value-at-Risk (MVaR) and bundling multiple questions, it emphasizes challenging reasoning paths, mitigates entropy collapse, and strengthens exploration. Theoretical results connect entropy preservation to tail-focused gradients, while extensive experiments across math, multi-modal reasoning, and code generation show consistent, significant gains over GRPO and other baselines, including improved Pass@1 and Pass@k metrics. Overall, risk-based optimization expands the reasoning frontier of LLMs and offers a principled, effective framework for enhancing reasoning capabilities in post-training settings.

Abstract

Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from entropy collapse and limited reasoning gains. We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths. To address these challenges, we propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures. Specifically, we introduce a Mixed Value-at-Risk objective that integrates weighted attention over multiple regions of the reward distribution, thereby amplifying gradient signals on challenging instances and preventing overconfident convergence. We further design a bundling scheme that aggregates multiple questions into bundles, thus enriching the feedback signal and yielding more stable and informative training dynamics. Theoretically, we prove that the risk-averse update alleviates entropy collapse and promotes exploration. Numerically, RiskPO achieves consistent and significant improvements in mathematical reasoning, multi-modal reasoning, and code generation benchmarks, surpassing GRPO and its variants on both Pass@1 and Pass@k metrics. Our results demonstrate that risk-based optimization provides a rigorous and effective paradigm for enhancing LLM reasoning capabilities.

RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

TL;DR

RiskPO introduces a distributional, risk-averse approach to RL with verifiable rewards for post-training large language models. By replacing mean-based objectives with Mixed Value-at-Risk (MVaR) and bundling multiple questions, it emphasizes challenging reasoning paths, mitigates entropy collapse, and strengthens exploration. Theoretical results connect entropy preservation to tail-focused gradients, while extensive experiments across math, multi-modal reasoning, and code generation show consistent, significant gains over GRPO and other baselines, including improved Pass@1 and Pass@k metrics. Overall, risk-based optimization expands the reasoning frontier of LLMs and offers a principled, effective framework for enhancing reasoning capabilities in post-training settings.

Abstract

Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from entropy collapse and limited reasoning gains. We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths. To address these challenges, we propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures. Specifically, we introduce a Mixed Value-at-Risk objective that integrates weighted attention over multiple regions of the reward distribution, thereby amplifying gradient signals on challenging instances and preventing overconfident convergence. We further design a bundling scheme that aggregates multiple questions into bundles, thus enriching the feedback signal and yielding more stable and informative training dynamics. Theoretically, we prove that the risk-averse update alleviates entropy collapse and promotes exploration. Numerically, RiskPO achieves consistent and significant improvements in mathematical reasoning, multi-modal reasoning, and code generation benchmarks, surpassing GRPO and its variants on both Pass@1 and Pass@k metrics. Our results demonstrate that risk-based optimization provides a rigorous and effective paradigm for enhancing LLM reasoning capabilities.

Paper Structure

This paper contains 22 sections, 5 theorems, 26 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Assume $F_\theta(r)$ is continuously differentiable with respect to both the parameter $\theta$ and the variable $r$; the density is positive at the quantiles, i.e., $f_\theta(F^{-1}_\theta(\alpha)) > 0$ and $f_\theta(F^{-1}_\theta(\beta)) > 0$; and that the differentiation under the integral sign i where $g(z,a,b)=(z-a)^+ - (z-b)^+ + a-b$, and $(z)^+=\max\{z,0\}$.

Figures (9)

  • Figure 1: Pass@32 and Avg@32 learning curves of DeepSeek-R1-Distill-Qwen-1.5B trained by RiskPO on AIME2024.
  • Figure 2: The framework of RiskPO.
  • Figure 3: Log-probabilities as a function of reward quantile levels for DeepSeek-R1-Distill-Qwen-1.5B on DAPOMATH-17K.
  • Figure 4: Pass@k learning curves on the AMC and MATH500 datasets.
  • Figure 5: Learning curves on DAPOMATH-17K, the RiskPO mitigates the entropy collapse and shows better performance on difficult problems, which is indicated by risk measures.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • proof
  • proof
  • Lemma 1
  • proof
  • proof : Proof of Theorem \ref{['theo:advantage_correlation']}.
  • Theorem 3
  • proof