A Statistical Framework for Alignment with Biased AI Feedback
Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai
TL;DR
This paper tackles the problem of aligning large language models when AI-generated preferences are biased relative to human judgments. It introduces two debiasing methods, Debiased Direct Preference Optimization (DDPO) and Debiased Identity Preference Optimization (DIPO), which integrate AI and human feedback within a unified statistical framework, employing density-ratio weighting and influence-function-based bias correction. The authors prove suboptimality and efficiency guarantees for DDPO and DIPO, including semiparametric efficiency for DIPO, and provide regret bounds under realistic overlap and nuisance-estimation assumptions. Empirically, DDPO and DIPO improve alignment performance across sentiment generation, summarization, and dialogue tasks, achieving results close to an oracle trained only on human data and demonstrating practical value for scalable, robust alignment pipelines.
Abstract
Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.
