RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training
Tao Ren, Jinyang Jiang, Hui Yang, Wan Tian, Minhao Zou, Guanghao Li, Zishi Zhang, Qinghao Wang, Shentao Qin, Yanjun Zhao, Rui Tao, Hui Shao, Yijie Peng
TL;DR
RiskPO introduces a distributional, risk-averse approach to RL with verifiable rewards for post-training large language models. By replacing mean-based objectives with Mixed Value-at-Risk (MVaR) and bundling multiple questions, it emphasizes challenging reasoning paths, mitigates entropy collapse, and strengthens exploration. Theoretical results connect entropy preservation to tail-focused gradients, while extensive experiments across math, multi-modal reasoning, and code generation show consistent, significant gains over GRPO and other baselines, including improved Pass@1 and Pass@k metrics. Overall, risk-based optimization expands the reasoning frontier of LLMs and offers a principled, effective framework for enhancing reasoning capabilities in post-training settings.
Abstract
Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from entropy collapse and limited reasoning gains. We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths. To address these challenges, we propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures. Specifically, we introduce a Mixed Value-at-Risk objective that integrates weighted attention over multiple regions of the reward distribution, thereby amplifying gradient signals on challenging instances and preventing overconfident convergence. We further design a bundling scheme that aggregates multiple questions into bundles, thus enriching the feedback signal and yielding more stable and informative training dynamics. Theoretically, we prove that the risk-averse update alleviates entropy collapse and promotes exploration. Numerically, RiskPO achieves consistent and significant improvements in mathematical reasoning, multi-modal reasoning, and code generation benchmarks, surpassing GRPO and its variants on both Pass@1 and Pass@k metrics. Our results demonstrate that risk-based optimization provides a rigorous and effective paradigm for enhancing LLM reasoning capabilities.
