Table of Contents
Fetching ...

Policy Teaching via Data Poisoning in Learning from Human Preferences

Andi Nika, Jonathan Nöther, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović

TL;DR

Policy Teaching via Data Poisoning investigates how an adversary can enforce a target policy by poisoning human-preference data in RLHF and DPO. The authors formalize a general poisoning framework and derive both lower and upper bounds on the attack sample complexity under data augmentation and data synthesis scenarios, for unregularized and regularized RLHF as well as for DPO. A key finding is that DPO tends to remain closer to the reference policy when the target is distant, suggesting it may be more robust to poisoning in certain regimes, while RLHF can be more susceptible depending on the data geometry and regularization. Together, these results provide a theoretical baseline for the robustness of two major preference-based learning paradigms and highlight design considerations for defense and policy teaching in practice.

Abstract

We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy $π^\dagger$ by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce $π^\dagger$. We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce $π^\dagger$. Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.

Policy Teaching via Data Poisoning in Learning from Human Preferences

TL;DR

Policy Teaching via Data Poisoning investigates how an adversary can enforce a target policy by poisoning human-preference data in RLHF and DPO. The authors formalize a general poisoning framework and derive both lower and upper bounds on the attack sample complexity under data augmentation and data synthesis scenarios, for unregularized and regularized RLHF as well as for DPO. A key finding is that DPO tends to remain closer to the reference policy when the target is distant, suggesting it may be more robust to poisoning in certain regimes, while RLHF can be more susceptible depending on the data geometry and regularization. Together, these results provide a theoretical baseline for the robustness of two major preference-based learning paradigms and highlight design considerations for defense and policy teaching in practice.

Abstract

We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce . We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce . Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.

Paper Structure

This paper contains 35 sections, 32 theorems, 279 equations, 2 figures.

Key Result

Theorem 4.1

Let $\overline{D}$ be a given preference dataset of $\overline{n}$ samples, let $\beta =0$, $\epsilon'>0$ and $\pi^\dagger\in\Pi^\textnormal{det}$. Furthermore, let $\overline{\omega}$ be optimal for $\ell^\omega_\textnormal{RLHF}(\overline{D})$, define $\omega^\dagger$ as and let $\gamma\geq 1-2\left\lVert\omega^\dagger\right\rVert/(\xi_{\max}+1)$. Then, the dataset of $\left\lceil \left\vert(\o

Figures (2)

  • Figure 1: A geometric illustration of our attack model for RLHF. The shaded regions represent the reward parameter spaces where optimal policies are $\epsilon$-close to $\pi^\dagger$. The blue arrows represent attack samples, while the yellow arrows represent the pre-existing data samples from $\overline{D}$. Finally, the red shape represents the optimal reward parameters with respect to the generated dataset $\widehat{D}$. Each added attack sample moves the optimal parameter closer to the shaded region. For unregularized RLHF with empty $\overline{D}$ (left), the attack problem is solved in the reward parameter space, and the target space is a polytope. For unregularized RLHF with non-empty $\overline{D}$ (middle), the required samples depend on the alignment of $\pi^\dagger$ with $\overline{D}$. For regularized RLHF (right), since the optimal policy is not necessarily deterministic, the geometry of the target space becomes non-linear.
  • Figure 2: A geometric illustration of our attack model for DPO. Here, the distinction between empty and non-empty $\overline{D}$ is similar to Figure \ref{['fig:rlhf-diagram']}. In contrast to the RLHF setting, here the attacker operates directly in the policy parameter space and the target feasible region is a ball centered around $\theta^\dagger$ with radius $\epsilon$ as outlined in the formulation of Problem \ref{['op:dpo-aug']}.

Theorems & Definitions (60)

  • Definition 2.1: Linear rewards
  • Definition 2.2: Loglinear policies
  • Definition 2.3: Bradley-Terry preference model bradley1952rank
  • Theorem 4.1
  • Corollary 4.1
  • Theorem 4.2
  • Remark 4.1
  • Corollary 4.2
  • Corollary 4.3
  • Remark 4.2
  • ...and 50 more