Table of Contents
Fetching ...

A Statistical Framework for Alignment with Biased AI Feedback

Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai

TL;DR

This paper tackles the problem of aligning large language models when AI-generated preferences are biased relative to human judgments. It introduces two debiasing methods, Debiased Direct Preference Optimization (DDPO) and Debiased Identity Preference Optimization (DIPO), which integrate AI and human feedback within a unified statistical framework, employing density-ratio weighting and influence-function-based bias correction. The authors prove suboptimality and efficiency guarantees for DDPO and DIPO, including semiparametric efficiency for DIPO, and provide regret bounds under realistic overlap and nuisance-estimation assumptions. Empirically, DDPO and DIPO improve alignment performance across sentiment generation, summarization, and dialogue tasks, achieving results close to an oracle trained only on human data and demonstrating practical value for scalable, robust alignment pipelines.

Abstract

Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.

A Statistical Framework for Alignment with Biased AI Feedback

TL;DR

This paper tackles the problem of aligning large language models when AI-generated preferences are biased relative to human judgments. It introduces two debiasing methods, Debiased Direct Preference Optimization (DDPO) and Debiased Identity Preference Optimization (DIPO), which integrate AI and human feedback within a unified statistical framework, employing density-ratio weighting and influence-function-based bias correction. The authors prove suboptimality and efficiency guarantees for DDPO and DIPO, including semiparametric efficiency for DIPO, and provide regret bounds under realistic overlap and nuisance-estimation assumptions. Empirically, DDPO and DIPO improve alignment performance across sentiment generation, summarization, and dialogue tasks, achieving results close to an oracle trained only on human data and demonstrating practical value for scalable, robust alignment pipelines.

Abstract

Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.
Paper Structure (17 sections, 5 theorems, 34 equations, 8 tables, 2 algorithms)

This paper contains 17 sections, 5 theorems, 34 equations, 8 tables, 2 algorithms.

Key Result

Theorem 1

Under Assumptions assump:realizability--assump:aif_hf, the suboptimality gap of the DDPO estimator satisfies where $\pi^{*}$ denotes the oracle optimal policy defined in eq:pi*_DPO and

Theorems & Definitions (5)

  • Theorem 1: Suboptimality gap for DDPO
  • Lemma 4.1
  • Theorem 2: Asymptotic expansion of DIPO
  • Corollary 1: Efficiency gain over human-only IPO
  • Theorem 3: Regret bound for DIPO