Table of Contents
Fetching ...

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Xiaoyi Li

Abstract

Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling $\sim$240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~$\pm$0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2$\times$2 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse ($-$11.5~pp, $p < 10^{-4}$). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH ($36\times$) and 0.47~pp on general-domain benchmarks ($41\times$), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (${\sim}$50~pp) $\gg$ training paradigm (${\sim}$10~pp) $\gg$ online vs.\ offline (${\sim}$9~pp) $\gg$ loss function (${\sim}$1~pp). We release all code, configs, and evaluation data as a living community benchmark.

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Abstract

Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling 240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 22 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse (11.5~pp, ). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH () and 0.47~pp on general-domain benchmarks (), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (50~pp) training paradigm (10~pp) online vs.\ offline (9~pp) loss function (1~pp). We release all code, configs, and evaluation data as a living community benchmark.
Paper Structure (20 sections, 1 equation, 4 figures, 15 tables)

This paper contains 20 sections, 1 equation, 4 figures, 15 tables.

Figures (4)

  • Figure 1: GSM8K accuracy across model scales. Dashed gray: base model. Error bars: $\pm$1$\sigma$ where multi-seed data is available. The V-shaped trajectories from 3B to 7B visualize the scale-driven ranking inversion; DPO's error bar at 3B reflects genuine seed variance ($\sigma = 2.01$ pp). At 1.5B, SGRPO achieves the highest accuracy (58.0%). Note: 0.5B--3B use full FT; 7B uses LoRA.
  • Figure 2: GSM8K accuracy across 5 seeds for 20 DPO variants at 1.5B (100 runs total). Dashed line: vanilla DPO mean (49.76%). Error bars: $\pm$1 std. $^\dagger$SimPO is the only variant significantly different from DPO ($p < 0.0026$ after Bonferroni correction)---and it is worse.
  • Figure 3: Training dynamics comparison at 0.5B scale. Left: training loss over steps (smoothed). Right: validation loss over epochs. DPO converges to lower loss; SimPO maintains higher, noisier loss with flat validation loss.
  • Figure 4: Training determinism at 0.5B: per-seed GSM8K accuracy for each algorithm across 3 seeds. Five of six algorithms produce identical accuracy ($\sigma = 0$). Only IPO shows variance ($\sigma = 0.27$).