Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment
Xu Chu, Zhixin Zhang, Tianyu Jia, Yujie Jin
TL;DR
This work tackles the data-efficiency challenge in aligning LLMs with human preferences by modeling preference optimization as a Stackelberg game (SGPO) between a policy leader and a worst-case follower within an $\epsilon$-Wasserstein ball. It proves the existence of a Stackelberg equilibrium, establishes $\mathcal{O}(\epsilon)$-bounded regret under distributional shifts, and contrasts this robustness with standard DPO. The authors instantiate SGPO as SSAPO, which bootstraps from a small seed of human labels, self-annotates prompts, and employs a distributionally robust reweighting to mitigate noise, achieving strong performance with only about 1/30 of typical human annotations. Empirically, SSAPO delivers competitive or superior results on AlpacaEval 2.0 and MT-Bench across multiple models and shows robustness to seed-noise, highlighting a practical path to data-efficient LLM alignment at scale.
Abstract
Aligning large language models (LLMs) with human preferences typically demands vast amounts of meticulously curated data, which is both expensive and prone to labeling noise. We propose Stackelberg Game Preference Optimization (SGPO), a robust alignment framework that models alignment as a two-player Stackelberg game between a policy (leader) and a worst-case preference distribution (follower). The proposed SGPO guarantees $\mathcal{O}(ε)$-bounded regret within an $ε$-Wasserstein ball, offering formal robustness to (self-)annotation noise. We instantiate SGPO with Stackelberg Self-Annotated Preference Optimization (SSAPO), which uses minimal human-labeled "seed" preferences and iteratively self-annotates new prompts. In each iteration, SSAPO applies a distributionally robust reweighting of synthetic annotations, ensuring that noisy or biased self-labels do not derail training. Remarkably, using only 2K seed preferences -- about 1/30 of standard human labels -- SSAPO achieves strong win rates against GPT-4 across multiple benchmarks within three iterations. These results highlight that a principled Stackelberg formulation yields data-efficient alignment for LLMs, significantly reducing reliance on costly human annotations.
