Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Zhenghao Xu; Qin Lu; Changlong Yu; Tuo Zhao

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao

TL;DR

This work analyzes PMD-mean, a practical off-policy regression variant for policy mirror descent in large-action RL settings like LLM post-training. It derives an exact Lambert-$W$ form for PMD-mean’s population update and proves its equivalence to mirror descent with an adaptive mixed KL--$\chi^2$ regularizer, providing a principled explanation for its stability under finite rollouts. The analysis shows that the induced $\chi^2$ term curbs large probability changes, especially when mean rewards are low, and demonstrates improved robustness to estimation error compared to the partition-based target. Empirically, PMD-mean yields stable, efficient improvements on math reasoning tasks, outperforming GRPO baselines and offering practical advantages in large-scale LLM training where rollout budgets are limited.

Abstract

Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer. This additional $χ^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

TL;DR

This work analyzes PMD-mean, a practical off-policy regression variant for policy mirror descent in large-action RL settings like LLM post-training. It derives an exact Lambert-

form for PMD-mean’s population update and proves its equivalence to mirror descent with an adaptive mixed KL--

regularizer, providing a principled explanation for its stability under finite rollouts. The analysis shows that the induced

term curbs large probability changes, especially when mean rewards are low, and demonstrates improved robustness to estimation error compared to the partition-based target. Empirically, PMD-mean yields stable, efficient improvements on math reasoning tasks, outperforming GRPO baselines and offering practical advantages in large-scale LLM training where rollout budgets are limited.

Abstract

regularizer. This additional

regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.

Paper Structure (34 sections, 15 theorems, 134 equations, 8 figures, 4 tables)

This paper contains 34 sections, 15 theorems, 134 equations, 8 figures, 4 tables.

Introduction
Preliminaries
Implicit Regularization of PMD-mean
Exact Solution of PMD-mean
PMD-mean as Mirror Descent with Mixed KL--$\chi^2$ Regularization
Implications on Convergence
One-Step Policy Improvement
Instantiation and Separation
Ideal Convergence Rate $\eta_t$
Log-ratio Bounds $(B,B_+)$ Compatible with Realizability
Target Estimation Error
Refined Analysis for PMD-mean
Experiments
Main Results
Related Work
...and 19 more sections

Key Result

Theorem 3.1

Assume $\pi_t(y)>0$ for all $y\in\mathcal{Y}$. Let $\Delta_y \coloneqq r(y) - \mathbb{E}_{y^\prime\sim \pi_t}[r(y^\prime)]$ denote the mean-baseline advantage. Then the unique minimizer of eq:L_mean over the probability simplex satisfies where $W(\cdot)$ is the principal branch of the Lambert-$W$ function (inverse of $f(w)=w\cdot e^w$) and $\lambda\ge 0$ is a normalization constant chosen such th

Figures (8)

Figure 1: Left: Scaled log-partition function vs average reward assuming binary rewards. The gap is significant for moderate $\tau$. Right: Illustration of PMD-mean and PMD-part converging to different subproblem solutions in the probability simplex.
Figure 2: The (log) probability ratio of updates in PMD-mean is more conservative than that in PMD-part for binary rewards.
Figure 3: Target estimation error of PMD-mean and PMD-part under $\tau=0.05$ and $p_t$ ranges from $0.01$ to $0.2$. Left: the target estimation error $\overline{\Delta^2}$. Right: The scaled estimation error with corresponding prefactor $e^{B_+}$ in \ref{['eq:one_step_improvement']}. The plot shows the average from $100$ random seeds. When the rollout sample size $n$ is small, the error of PMD-part is much larger for small $p_t$.
Figure 4: Training curves (smoothed) of PMD-mean (upper) and PMD-part (lower) with baselines for Qwen2.5-7B on DAPO-Math-17k (left) and the averaged evaluation accuracy on AIME 2024 and AIME 2025 (right). The global step of on-policy gradient is divided by 16 to match other algorithms.
Figure 5: The minimum of log-ratios $\log\frac{\pi_{t+1}}{\pi_t}$ in PMD-mean and PMD-part, estimated from the last update mini-batch.
...and 3 more figures

Theorems & Definitions (29)

Theorem 3.1: PMD-mean solution
Proposition 3.2: Equivalent mixed KL--$\chi^2$ subproblem
Remark 3.3: Connection to $\chi^2$ preference optimization
Remark 3.4: Policy ratios compared to huang2024correcting
Lemma 4.5: Empirical minimization
Theorem 4.6: One-step policy improvement
Proposition 4.7: Ideal contraction for PMD-mean with small $\tau$
Proposition 4.8: Ideal contraction for PMD-part
Proposition 4.9: Log-ratios for PMD-mean with small $\tau$
Proposition 4.10: Log-ratios for PMD-part with small $\tau$
...and 19 more

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

TL;DR

Abstract

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (29)