EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL
Lunjun Zhang, Jimmy Ba
TL;DR
This paper addresses instability and sample inefficiency in policy-gradient RL for large language models by introducing two simple techniques: an EMA anchor that replaces a fixed reference policy with a smoothed target ${\theta_{\mathrm{ema}}}$, and a Top-k KL estimator that blends exact KL on the top-$k$ tokens with a tail-sampled correction to keep estimates unbiased with ${O}(k)$ memory. The authors argue for token-level KL over sequence-level KL in reasoning and agentic RL, and develop unbiased estimators such as ${\mathrm{K3}^{++}}$, ${\mathrm{K4}}$, and ${\mathrm{K5}}$, along with a Top-k framework that generalizes to ${|\mathcal{V}|}$ (exact) and ${0}$ (sampled). They provide a theoretical stability analysis of EMA-PG dynamics and demonstrate substantial improvements on math reasoning benchmarks (OlympiadBench) and agentic QA tasks with search (HotpotQA, 2WikiMultiHopQA, Bamboogle). Overall, EMA-PG is presented as a simple, principled, and scalable approach to boosting RL for LLMs, with code available at the project URL. The work suggests that carefully designed KL estimators and momentum-style anchors can yield meaningful performance gains with manageable memory overhead in real-world LLM RL settings.
Abstract
Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% $\rightarrow$ 44.1% on HotpotQA, 27.4% $\rightarrow$ 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema-pg
