A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO
Xingyu Zhou, Yulian Wu, Francesco Orabona
TL;DR
This work addresses the impact of noisy labels arising from corruption and privacy in offline alignment for RLHF and DPO by developing a unified framework that covers three privacy-corruption orders (CTL, LTC, CLC) under linear modeling. It introduces a reduction to logistic regression and demonstrates a separation: local privacy before corruption (LTC) is harder than corruption before privacy (CTL). The authors propose a private-robust estimation algorithm based on randomized response and a novel unbiased loss, yielding state-of-the-art guarantees in privacy-only or corruption-only regimes and establishing O(1/√n) rates for DPO under corruption. The results offer principled guidance for designing private and robust offline alignment systems and are supported by experiments showing the CTL/LTC separation in practice.
Abstract
In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.
