Table of Contents
Fetching ...

A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO

Xingyu Zhou, Yulian Wu, Francesco Orabona

TL;DR

This work addresses the impact of noisy labels arising from corruption and privacy in offline alignment for RLHF and DPO by developing a unified framework that covers three privacy-corruption orders (CTL, LTC, CLC) under linear modeling. It introduces a reduction to logistic regression and demonstrates a separation: local privacy before corruption (LTC) is harder than corruption before privacy (CTL). The authors propose a private-robust estimation algorithm based on randomized response and a novel unbiased loss, yielding state-of-the-art guarantees in privacy-only or corruption-only regimes and establishing O(1/√n) rates for DPO under corruption. The results offer principled guidance for designing private and robust offline alignment systems and are supported by experiments showing the CTL/LTC separation in practice.

Abstract

In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.

A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO

TL;DR

This work addresses the impact of noisy labels arising from corruption and privacy in offline alignment for RLHF and DPO by developing a unified framework that covers three privacy-corruption orders (CTL, LTC, CLC) under linear modeling. It introduces a reduction to logistic regression and demonstrates a separation: local privacy before corruption (LTC) is harder than corruption before privacy (CTL). The authors propose a private-robust estimation algorithm based on randomized response and a novel unbiased loss, yielding state-of-the-art guarantees in privacy-only or corruption-only regimes and establishing O(1/√n) rates for DPO under corruption. The results offer principled guidance for designing private and robust offline alignment systems and are supported by experiments showing the CTL/LTC separation in practice.

Abstract

In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.

Paper Structure

This paper contains 31 sections, 12 theorems, 81 equations, 5 tables, 2 algorithms.

Key Result

Proposition 4.2

Under Assumption ass:lin-reward, the labels $\{y_i\}_{i\in [n]}$ in the preference dataset of RLHF follow the logistic regression model with ${\theta}_{\mathrm{true}} = \theta^{\star}$ and $x_i = \phi(s_i,a_i^1)- \phi(s_i,a_i^0)$. Algorithm alg:RLHF with $\eta = 0$ achieves where $\pi^{\star} = \mathop{\mathrm{argmax}}_{\pi} J(\pi)$. Further, let $\widehat{\Sigma}:=\frac{1}{n}\sum_i x_i x_i^{\top

Theorems & Definitions (38)

  • Definition 3.1: Label DP in Local Model chowdhury2023differentially
  • Remark 3.2: Randomized Response
  • Definition 3.3: Label Corruption Model
  • Definition 3.4: CTL and LTC
  • Remark 3.5
  • Proposition 4.2
  • Definition 4.3: Relative Condition Number
  • Corollary 4.4
  • Proposition 4.6
  • Remark 4.7
  • ...and 28 more