Table of Contents
Fetching ...

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu

TL;DR

BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals, and forms this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences.

Abstract

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

TL;DR

BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals, and forms this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences.

Abstract

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
Paper Structure (49 sections, 8 theorems, 47 equations, 3 figures, 1 table)

This paper contains 49 sections, 8 theorems, 47 equations, 3 figures, 1 table.

Key Result

Lemma 1

Given a reference distribution $P \in \Delta^V$ with full support (i.e., $P(v)>0, \forall v$) and an action $a \in \mathcal{V}$, the optimal solution $Q^\star$ to the extremal Problems eq:upper_bound_opt and eq:lower_bound_opt must strictly preserve the relative probability proportions within the co where the scaling factor $c \in \mathbb{R}_{+}$ is uniquely determined by the simplex normalization

Figures (3)

  • Figure 1: Comparison of clipping bounds between BandPO and baselines. (a) Comparison of ratio clipping regions: BandPO vs. DAPO. While DAPO enforces fixed asymmetric bounds ($\epsilon_+ = 0.28, \epsilon_- = 0.2$), BandPO projects a KL-induced trust region ($\delta = 0.1$) into dynamic bounds. The blue region highlights the expanded margin for low-probability, positive-advantage actions, effectively preventing premature saturation and preserving critical exploration gradients. (b) Comparison of the Bounds of Probability Variation. We visualize the bounds of variation derived from the Theoretical Simplex, DAPO, DCPO, and BandPO (ours). The symbols $\bar{r}$ and $\underline{r}$ denote the upper and lower clipping boundaries, respectively. Parameters are fixed at $\epsilon_+ = 0.28, \epsilon_- = 0.2$, and $\delta=0.1$. BandPO strictly adheres to physical simplex constraints while unlocking significant upward variation for low-probability actions.
  • Figure 2: Comparison of Probability Ratio Bounds. We visualize the ratio bounds derived from the Theoretical Simplex, DAPO, DCPO, and BandPO (ours). As $p \to 1$, BandPO bounds exhibit strict monotonicity: upper bounds decrease while lower bounds increase towards 1. Conversely, as $p \to 0$, the upper bounds of both DCPO and BandPO expand rapidly, effectively preventing premature clipping. Note that for TV and $\chi^2$, the radius $\delta=0.1$ triggers the simplex saturation condition, where the lower bounds explicitly align with the theoretical limit of $0$.
  • Figure 3: Comparison of training dynamics. (a) Overall clip rate measuring the fraction of clipped tokens relative to total tokens per update. (b) Proportion of clip-high for low-probability tokens ($p < 0.2$) relative to total clipped tokens, identifying erroneous tail-action suppression. (c) Evolution of policy entropy measuring the concentration of action distributions. Our method (red) effectively prevents mode collapse by mitigating the vanishing margin issue in (b).

Theorems & Definitions (16)

  • Lemma 1: Optimality of Uniform Complement Rescaling
  • Theorem 1: Exact Scalarization of Trust-Region Constraints
  • Proposition 1: Asymptotic Behavior of Band Bounds
  • Proposition 2: Strict Monotonicity of Band Bounds
  • Proposition 3: Constraint Saturation
  • Proposition 4: Closed-Form Band Bounds for TV and Pearson $\chi^2$
  • proof
  • proof
  • proof
  • proof
  • ...and 6 more