Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training
Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
TL;DR
This work analyzes PMD-mean, a practical off-policy regression variant for policy mirror descent in large-action RL settings like LLM post-training. It derives an exact Lambert-$W$ form for PMD-mean’s population update and proves its equivalence to mirror descent with an adaptive mixed KL--$\chi^2$ regularizer, providing a principled explanation for its stability under finite rollouts. The analysis shows that the induced $\chi^2$ term curbs large probability changes, especially when mean rewards are low, and demonstrates improved robustness to estimation error compared to the partition-based target. Empirically, PMD-mean yields stable, efficient improvements on math reasoning tasks, outperforming GRPO baselines and offering practical advantages in large-scale LLM training where rollout budgets are limited.
Abstract
Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer. This additional $χ^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.
