Table of Contents
Fetching ...

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao

TL;DR

This work analyzes PMD-mean, a practical off-policy regression variant for policy mirror descent in large-action RL settings like LLM post-training. It derives an exact Lambert-$W$ form for PMD-mean’s population update and proves its equivalence to mirror descent with an adaptive mixed KL--$\chi^2$ regularizer, providing a principled explanation for its stability under finite rollouts. The analysis shows that the induced $\chi^2$ term curbs large probability changes, especially when mean rewards are low, and demonstrates improved robustness to estimation error compared to the partition-based target. Empirically, PMD-mean yields stable, efficient improvements on math reasoning tasks, outperforming GRPO baselines and offering practical advantages in large-scale LLM training where rollout budgets are limited.

Abstract

Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer. This additional $χ^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

TL;DR

This work analyzes PMD-mean, a practical off-policy regression variant for policy mirror descent in large-action RL settings like LLM post-training. It derives an exact Lambert- form for PMD-mean’s population update and proves its equivalence to mirror descent with an adaptive mixed KL-- regularizer, providing a principled explanation for its stability under finite rollouts. The analysis shows that the induced term curbs large probability changes, especially when mean rewards are low, and demonstrates improved robustness to estimation error compared to the partition-based target. Empirically, PMD-mean yields stable, efficient improvements on math reasoning tasks, outperforming GRPO baselines and offering practical advantages in large-scale LLM training where rollout budgets are limited.

Abstract

Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL-- regularizer. This additional regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.
Paper Structure (34 sections, 15 theorems, 134 equations, 8 figures, 4 tables)

This paper contains 34 sections, 15 theorems, 134 equations, 8 figures, 4 tables.

Key Result

Theorem 3.1

Assume $\pi_t(y)>0$ for all $y\in\mathcal{Y}$. Let $\Delta_y \coloneqq r(y) - \mathbb{E}_{y^\prime\sim \pi_t}[r(y^\prime)]$ denote the mean-baseline advantage. Then the unique minimizer of eq:L_mean over the probability simplex satisfies where $W(\cdot)$ is the principal branch of the Lambert-$W$ function (inverse of $f(w)=w\cdot e^w$) and $\lambda\ge 0$ is a normalization constant chosen such th

Figures (8)

  • Figure 1: Left: Scaled log-partition function vs average reward assuming binary rewards. The gap is significant for moderate $\tau$. Right: Illustration of PMD-mean and PMD-part converging to different subproblem solutions in the probability simplex.
  • Figure 2: The (log) probability ratio of updates in PMD-mean is more conservative than that in PMD-part for binary rewards.
  • Figure 3: Target estimation error of PMD-mean and PMD-part under $\tau=0.05$ and $p_t$ ranges from $0.01$ to $0.2$. Left: the target estimation error $\overline{\Delta^2}$. Right: The scaled estimation error with corresponding prefactor $e^{B_+}$ in \ref{['eq:one_step_improvement']}. The plot shows the average from $100$ random seeds. When the rollout sample size $n$ is small, the error of PMD-part is much larger for small $p_t$.
  • Figure 4: Training curves (smoothed) of PMD-mean (upper) and PMD-part (lower) with baselines for Qwen2.5-7B on DAPO-Math-17k (left) and the averaged evaluation accuracy on AIME 2024 and AIME 2025 (right). The global step of on-policy gradient is divided by 16 to match other algorithms.
  • Figure 5: The minimum of log-ratios $\log\frac{\pi_{t+1}}{\pi_t}$ in PMD-mean and PMD-part, estimated from the last update mini-batch.
  • ...and 3 more figures

Theorems & Definitions (29)

  • Theorem 3.1: PMD-mean solution
  • Proposition 3.2: Equivalent mixed KL--$\chi^2$ subproblem
  • Remark 3.3: Connection to $\chi^2$ preference optimization
  • Remark 3.4: Policy ratios compared to huang2024correcting
  • Lemma 4.5: Empirical minimization
  • Theorem 4.6: One-step policy improvement
  • Proposition 4.7: Ideal contraction for PMD-mean with small $\tau$
  • Proposition 4.8: Ideal contraction for PMD-part
  • Proposition 4.9: Log-ratios for PMD-mean with small $\tau$
  • Proposition 4.10: Log-ratios for PMD-part with small $\tau$
  • ...and 19 more