Table of Contents
Fetching ...

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

TL;DR

This work addresses length-dependent bias in sequence-level RL for LLMs caused by fixed clipping of the sequence-level log-IS ratio $S(y|x)$. It introduces FSPO, a log-space clipping method with a $\,\sqrt{L}$-scaled band $b_L$ that achieves length fairness and preserves IS semantics. The authors formalize Length Reweighting Error (LRE), prove a cosine-direction guarantee between clipped and true updates under mild assumptions, and show $S_L$ is asymptotically Gaussian, supporting the clipping design. Empirically, FSPO yields flatter length-wise clip rates, more stable learning dynamics, and superior performance on MATH500, AIME24, and AIME25, with the largest gains on the 8B model, validating its practical impact for RLVR in LLMs.

Abstract

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as $\sqrt{L}$. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, with the largest gains on the Qwen3-8B-Base model.

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

TL;DR

This work addresses length-dependent bias in sequence-level RL for LLMs caused by fixed clipping of the sequence-level log-IS ratio . It introduces FSPO, a log-space clipping method with a -scaled band that achieves length fairness and preserves IS semantics. The authors formalize Length Reweighting Error (LRE), prove a cosine-direction guarantee between clipped and true updates under mild assumptions, and show is asymptotically Gaussian, supporting the clipping design. Empirically, FSPO yields flatter length-wise clip rates, more stable learning dynamics, and superior performance on MATH500, AIME24, and AIME25, with the largest gains on the 8B model, validating its practical impact for RLVR in LLMs.

Abstract

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as . Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, with the largest gains on the Qwen3-8B-Base model.

Paper Structure

This paper contains 30 sections, 2 theorems, 29 equations, 5 figures, 5 tables.

Key Result

Theorem 2.1

Under ass:weak-stratass:cov,

Figures (5)

  • Figure 1: Empirical analysis of the sequence-level IS ratio. Sample size $n=217{,}454$. Left: Empirical distribution of $S_L$. The yellow line shows the empirical mean and the shaded band the $\pm2$ empirical standard deviation, computed with a bin size of 200 (see Appendix \ref{['app:angle-proof']} for justification of binning). Right: Q–Q plot testing normality. The sorted data point quantiles are shown in blue dots. We report the slope $m$, intercept $b$, and $R^2$ of the fitted line.
  • Figure 2: Theoretical and empirical clip fraction.Left: Theoretical clip probability $c(L)$ computed from \ref{['eq:c_rloo', 'eq:c_gspo']} using the hyperparameters in Appendix \ref{['app:hyperparam_tuning']}, where we set $\xi=\log(1+c_{\text{upper}})$. Right: Observed clip fraction $\hat{c}(L)$ with bin size $=200$, collected from the experiments on Qwen3-8B-Base model.
  • Figure 3: Learning dynamics during training. Left column: mean reward (1.7B and 8B). Right column: mean response length (1.7B and 8B). Reward curves are smoothed with EMA for visualization.
  • Figure 4: Overlong rate and mean response length on AIME24. We plot evaluation-time sampling; the x-axis orders the 30 problems from easy to hard, where difficulty is measured by the overall average accuracy across the four methods.
  • Figure 5: Ablation: fixed larger clip range.Left: mean rewards during training. Right: validation curves during training.

Theorems & Definitions (3)

  • Definition 2.1: Length Reweighting Error (LRE)
  • Theorem 2.1: Directional guarantee under length fairness
  • Theorem 3.1: Gaussianity of the sequence-level log ratio