Table of Contents
Fetching ...

EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL

Lunjun Zhang, Jimmy Ba

TL;DR

This paper addresses instability and sample inefficiency in policy-gradient RL for large language models by introducing two simple techniques: an EMA anchor that replaces a fixed reference policy with a smoothed target ${\theta_{\mathrm{ema}}}$, and a Top-k KL estimator that blends exact KL on the top-$k$ tokens with a tail-sampled correction to keep estimates unbiased with ${O}(k)$ memory. The authors argue for token-level KL over sequence-level KL in reasoning and agentic RL, and develop unbiased estimators such as ${\mathrm{K3}^{++}}$, ${\mathrm{K4}}$, and ${\mathrm{K5}}$, along with a Top-k framework that generalizes to ${|\mathcal{V}|}$ (exact) and ${0}$ (sampled). They provide a theoretical stability analysis of EMA-PG dynamics and demonstrate substantial improvements on math reasoning benchmarks (OlympiadBench) and agentic QA tasks with search (HotpotQA, 2WikiMultiHopQA, Bamboogle). Overall, EMA-PG is presented as a simple, principled, and scalable approach to boosting RL for LLMs, with code available at the project URL. The work suggests that carefully designed KL estimators and momentum-style anchors can yield meaningful performance gains with manageable memory overhead in real-world LLM RL settings.

Abstract

Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% $\rightarrow$ 44.1% on HotpotQA, 27.4% $\rightarrow$ 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema-pg

EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL

TL;DR

This paper addresses instability and sample inefficiency in policy-gradient RL for large language models by introducing two simple techniques: an EMA anchor that replaces a fixed reference policy with a smoothed target , and a Top-k KL estimator that blends exact KL on the top- tokens with a tail-sampled correction to keep estimates unbiased with memory. The authors argue for token-level KL over sequence-level KL in reasoning and agentic RL, and develop unbiased estimators such as , , and , along with a Top-k framework that generalizes to (exact) and (sampled). They provide a theoretical stability analysis of EMA-PG dynamics and demonstrate substantial improvements on math reasoning benchmarks (OlympiadBench) and agentic QA tasks with search (HotpotQA, 2WikiMultiHopQA, Bamboogle). Overall, EMA-PG is presented as a simple, principled, and scalable approach to boosting RL for LLMs, with code available at the project URL. The work suggests that carefully designed KL estimators and momentum-style anchors can yield meaningful performance gains with manageable memory overhead in real-world LLM RL settings.

Abstract

Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% 44.1% on HotpotQA, 27.4% 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema-pg
Paper Structure (108 sections, 6 theorems, 228 equations, 11 figures, 10 tables, 3 algorithms)

This paper contains 108 sections, 6 theorems, 228 equations, 11 figures, 10 tables, 3 algorithms.

Key Result

Lemma 4.1

Quasi-steady state: If we assume that the gradient and the Fisher matrix change extremely slowly over $k$ steps, meaning that for $\Delta =0 \cdots k-1$: then, under the following definitions of matrices $D$, $S_k$, $M_{k}$: we have: firstly, $I-D$ is invertible and $S_{k} = (I-D^{k})(I-D)^{-1}$, $M_{k} = (kI - S_{k})(I-D)^{-1}$; secondly, we have a closed-form dynamics of EMA Policy Gradient:

Figures (11)

  • Figure 1: Using EMA anchor policy and Top-$k$ KL estimator significantly improves the performance of RL algorithms like GRPO.
  • Figure 2: Top-k KL computes exact KL on the top-k indices, and adds a masked sampled KL to keep the estimator unbiased.
  • Figure 3: On math reasoning datasets, using EMA Anchor improves not only Pass@1 but Pass@N as well compared to GRPO. Results obtained on a 1.5B model DeepSeek-R1-Distill-Qwen-1.5B deepseekr1.
  • Figure 4: Bias-variance tradeoff in Top-k KL estimator: Top-$k$ KL (unbiased) has lower gradient error than sampled KL (unbiased) in all regimes, but only outperforms Truncated KL (biased) beyond a certain critical sample size; we illustrate this bias-variance tradeoff in a synthetic setting (§\ref{['app:topk_kl_sim_exp']}). Rule of thumb: when using a small $k$, apply the tail correction \ref{['eq:tail-correction']} (default); for a large $k$, consider Truncated KL.
  • Figure 5: EMA-PG not only learns faster and but reaches higher asymptotic performance on agentic tasks of Q&A with search engine. Using Topk KL Estimator, forward KL learns slightly faster than reverse KL. Both forward and reverse Top-k KL significantly outperforms GRPO (and GRPO with EMA Anchor). This shows that the combination of EMA Anchor and Top-k KL is highly effective.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Lemma 4.1
  • Lemma 4.2
  • Lemma 3.1: Lag dynamics
  • proof
  • Proposition 9.1
  • proof
  • Proposition 9.2
  • proof
  • Lemma 10.1
  • proof