Table of Contents
Fetching ...

Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment

Xu Chu, Zhixin Zhang, Tianyu Jia, Yujie Jin

TL;DR

This work tackles the data-efficiency challenge in aligning LLMs with human preferences by modeling preference optimization as a Stackelberg game (SGPO) between a policy leader and a worst-case follower within an $\epsilon$-Wasserstein ball. It proves the existence of a Stackelberg equilibrium, establishes $\mathcal{O}(\epsilon)$-bounded regret under distributional shifts, and contrasts this robustness with standard DPO. The authors instantiate SGPO as SSAPO, which bootstraps from a small seed of human labels, self-annotates prompts, and employs a distributionally robust reweighting to mitigate noise, achieving strong performance with only about 1/30 of typical human annotations. Empirically, SSAPO delivers competitive or superior results on AlpacaEval 2.0 and MT-Bench across multiple models and shows robustness to seed-noise, highlighting a practical path to data-efficient LLM alignment at scale.

Abstract

Aligning large language models (LLMs) with human preferences typically demands vast amounts of meticulously curated data, which is both expensive and prone to labeling noise. We propose Stackelberg Game Preference Optimization (SGPO), a robust alignment framework that models alignment as a two-player Stackelberg game between a policy (leader) and a worst-case preference distribution (follower). The proposed SGPO guarantees $\mathcal{O}(ε)$-bounded regret within an $ε$-Wasserstein ball, offering formal robustness to (self-)annotation noise. We instantiate SGPO with Stackelberg Self-Annotated Preference Optimization (SSAPO), which uses minimal human-labeled "seed" preferences and iteratively self-annotates new prompts. In each iteration, SSAPO applies a distributionally robust reweighting of synthetic annotations, ensuring that noisy or biased self-labels do not derail training. Remarkably, using only 2K seed preferences -- about 1/30 of standard human labels -- SSAPO achieves strong win rates against GPT-4 across multiple benchmarks within three iterations. These results highlight that a principled Stackelberg formulation yields data-efficient alignment for LLMs, significantly reducing reliance on costly human annotations.

Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment

TL;DR

This work tackles the data-efficiency challenge in aligning LLMs with human preferences by modeling preference optimization as a Stackelberg game (SGPO) between a policy leader and a worst-case follower within an -Wasserstein ball. It proves the existence of a Stackelberg equilibrium, establishes -bounded regret under distributional shifts, and contrasts this robustness with standard DPO. The authors instantiate SGPO as SSAPO, which bootstraps from a small seed of human labels, self-annotates prompts, and employs a distributionally robust reweighting to mitigate noise, achieving strong performance with only about 1/30 of typical human annotations. Empirically, SSAPO delivers competitive or superior results on AlpacaEval 2.0 and MT-Bench across multiple models and shows robustness to seed-noise, highlighting a practical path to data-efficient LLM alignment at scale.

Abstract

Aligning large language models (LLMs) with human preferences typically demands vast amounts of meticulously curated data, which is both expensive and prone to labeling noise. We propose Stackelberg Game Preference Optimization (SGPO), a robust alignment framework that models alignment as a two-player Stackelberg game between a policy (leader) and a worst-case preference distribution (follower). The proposed SGPO guarantees -bounded regret within an -Wasserstein ball, offering formal robustness to (self-)annotation noise. We instantiate SGPO with Stackelberg Self-Annotated Preference Optimization (SSAPO), which uses minimal human-labeled "seed" preferences and iteratively self-annotates new prompts. In each iteration, SSAPO applies a distributionally robust reweighting of synthetic annotations, ensuring that noisy or biased self-labels do not derail training. Remarkably, using only 2K seed preferences -- about 1/30 of standard human labels -- SSAPO achieves strong win rates against GPT-4 across multiple benchmarks within three iterations. These results highlight that a principled Stackelberg formulation yields data-efficient alignment for LLMs, significantly reducing reliance on costly human annotations.

Paper Structure

This paper contains 92 sections, 26 theorems, 84 equations, 2 figures, 5 tables, 1 algorithm.

Key Result

Theorem 2.2

Under the assumptions above, problem eq:sgpo_robust_obj admits at least one solution $(\pi^*,\alpha^*)$.

Figures (2)

  • Figure 1: SSAPO workflow. We maintain a large prompt pool and a small set of seed-labeled preferences. The policy self-annotates new prompts by generating and ranking responses, thus expanding the preference database. A follower then identifies a worst-case distribution for these preferences, and the leader (policy) is updated accordingly. This process repeats for iterations.
  • Figure 2: Improvement during iterations Evaluation results on AlpcaEval 2.0 of initial DPO stage and each iteration, the results of the SFT model are from Kim2025Spread.

Theorems & Definitions (45)

  • Definition 2.1: Stackelberg equilibrium
  • Theorem 2.2: Existence of a Stackelberg equilibrium
  • Theorem 2.3: Well-posedness and local linear convergence
  • Theorem 2.4: Worst-case performance in gap space
  • Theorem 2.5: SGPO regret bound
  • Theorem 2.6: DPO regret lower bound
  • Corollary 2.7: SGPO advantage over DPO
  • Lemma 3.1: Closed‑form follower in the unconstrained one‑dimensional case; Esfahani2018Data, Thm. 6.3
  • Proposition 3.2: Monotone tightening in $K$
  • Theorem 3.3: Finite convex program for max‑of‑affine (PWL convex) losses; specialization of Esfahani2018Data, Thm. 4.4
  • ...and 35 more