Table of Contents
Fetching ...

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang

Abstract

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Abstract

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.
Paper Structure (40 sections, 5 theorems, 33 equations, 10 figures, 4 tables)

This paper contains 40 sections, 5 theorems, 33 equations, 10 figures, 4 tables.

Key Result

Theorem 1

When perturbation is not too large to distort the original distribution, we have where $\mathcal{O}(\cdot)$ hides absolute constants and lower-order terms.

Figures (10)

  • Figure 1: Left: Visualization of Adaptive Layerwise Perturbation (ALP), where layerwise small perturbation variable is added to the model during training to cover the layerwise bias between training and inference policies. Right: the comparison between full training with perturbation (ALP) or without perturbation (Bypass) across $1840$ training steps. Without perturbation, the policy distribution becomes more spiky and brittle, leading to blow-up tails. In contrast, ALP smooths policies, tightens the envelope and stabilizes the importance ratio.
  • Figure 2: Benefits of perturbation on smoothness: (Left) Toy simulation to show how perturbation smooths a sharp, spiky objective into a flatter surrogate, reducing sensitivity to local sharp maxima and promoting progress toward broader optima. (Middle, Right) Multi-turn, one-iteration controlled comparison. Starting from the same checkpoint and same batch of rollout samples, we perform $16$ off-policy updating steps without perturbation (middle) and with perturbation (right). Perturbation markedly shrinks the mismatch envelope, especially for low-probability tokens, thereby reducing extreme training–inference deviations.
  • Figure 3: Single-turn training dynamics under training-inference mismatch. We report step-wise (policy-update steps) optimization diagnostics (batch-averaged reward, gradient norm, policy entropy, and KL divergence between rollout and training policies) for GRPO, MIS, Bypass, and ALP. The reward mean is smoothed with a 10-step moving average.
  • Figure 4: Training dynamics: The leftmost plot illustrates that Seq-ALP maintains higher entropy than MIS baselines, ensuring sufficient exploration, while avoiding the unstable entropy growth ("blow-up") observed in Seq-Bypass. The center and right plots show that Seq-ALP maintains consistent convergence in both KL metrics. In contrast, Token-MIS exhibits significant instability in Train-Inference KL, and Seq-MIS suffers from sharp spikes in Policy Update KL, highlighting the robustness of the ALP method.
  • Figure 5: Pass@k performance analysis on AIME 2024 and AIME 2025 datasets for TIR tasks. The results compare ALP against baseline methods across varying rollout numbers ($k$). ALP consistently achieves the highest accuracy in the range of $k=16 \sim 256$, indicating that adaptive latent perturbation significantly enhances the model's exploration efficiency and solution diversity.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Theorem 1: Informal
  • Theorem 2
  • Remark 1: Token level and sequence level
  • Lemma 1: Stam's inequality
  • Lemma 2
  • proof
  • Theorem 3: Formal Version of Theorem \ref{['thm:mismatch_informal']}
  • proof