Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Xiaoyi Li

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Xiaoyi Li

Abstract

Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling $\sim$240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~$\pm$0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2$\times$2 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse ($-$11.5~pp, $p < 10^{-4}$). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH ($36\times$) and 0.47~pp on general-domain benchmarks ($41\times$), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (${\sim}$50~pp) $\gg$ training paradigm (${\sim}$10~pp) $\gg$ online vs.\ offline (${\sim}$9~pp) $\gg$ loss function (${\sim}$1~pp). We release all code, configs, and evaluation data as a living community benchmark.

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Abstract

240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~

0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2

2 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse (

11.5~pp,

). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH (

) and 0.47~pp on general-domain benchmarks (

), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (

50~pp)

training paradigm (

10~pp)

online vs.\ offline (

9~pp)

loss function (

1~pp). We release all code, configs, and evaluation data as a living community benchmark.

Paper Structure (20 sections, 1 equation, 4 figures, 15 tables)

This paper contains 20 sections, 1 equation, 4 figures, 15 tables.

Introduction
Related Work
Methods
Framework Design
Algorithm Taxonomy
Evaluation Protocol
Experimental Setup
Experiments and Results
Core Algorithm Comparison Across Scales
DPO Variant Taxonomy
Online RL vs. Offline Preference
Discussion
Conclusion
Full DPO Variant Results
Training Configurations
...and 5 more sections

Figures (4)

Figure 1: GSM8K accuracy across model scales. Dashed gray: base model. Error bars: $\pm$1$\sigma$ where multi-seed data is available. The V-shaped trajectories from 3B to 7B visualize the scale-driven ranking inversion; DPO's error bar at 3B reflects genuine seed variance ($\sigma = 2.01$ pp). At 1.5B, SGRPO achieves the highest accuracy (58.0%). Note: 0.5B--3B use full FT; 7B uses LoRA.
Figure 2: GSM8K accuracy across 5 seeds for 20 DPO variants at 1.5B (100 runs total). Dashed line: vanilla DPO mean (49.76%). Error bars: $\pm$1 std. $^\dagger$SimPO is the only variant significantly different from DPO ($p < 0.0026$ after Bonferroni correction)---and it is worse.
Figure 3: Training dynamics comparison at 0.5B scale. Left: training loss over steps (smoothed). Right: validation loss over epochs. DPO converges to lower loss; SimPO maintains higher, noisier loss with flat validation loss.
Figure 4: Training determinism at 0.5B: per-seed GSM8K accuracy for each algorithm across 3 seeds. Five of six algorithms produce identical accuracy ($\sigma = 0$). Only IPO shows variance ($\sigma = 0.27$).

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Abstract

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Authors

Abstract

Table of Contents

Figures (4)