Table of Contents
Fetching ...

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka, Keita Saito, Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

TL;DR

The paper investigates the fundamental cost of poisoning LLM alignment via label-flipping during RLHF/DPO, reframing the attack as steering the reward model to a target function with minimal label flips. It derives convex/linear programming bounds for the minimal flip cost under fixed and adaptive embeddings and introduces a practical post-processing method (PCM) to minimize flips for any existing attack while preserving poisoning effects. Theoretical results show that attacks can be significantly cheaper when the reward-feature dimension is small relative to data and that adaptive embedding scenarios admit worst-case low-cost attacks under suitable representational capacity. Empirical evaluations on synthetic data and public LLM datasets demonstrate substantial cost reductions with PCM, especially as dataset size grows, while also quantifying the trade-offs in performance loss and discretization. Overall, the work highlights a key vulnerability in RLHF/DPO pipelines and provides a concrete framework and tools for stress-testing and robustness assessment against low-cost data-poisoning threats.

Abstract

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

TL;DR

The paper investigates the fundamental cost of poisoning LLM alignment via label-flipping during RLHF/DPO, reframing the attack as steering the reward model to a target function with minimal label flips. It derives convex/linear programming bounds for the minimal flip cost under fixed and adaptive embeddings and introduces a practical post-processing method (PCM) to minimize flips for any existing attack while preserving poisoning effects. Theoretical results show that attacks can be significantly cheaper when the reward-feature dimension is small relative to data and that adaptive embedding scenarios admit worst-case low-cost attacks under suitable representational capacity. Empirical evaluations on synthetic data and public LLM datasets demonstrate substantial cost reductions with PCM, especially as dataset size grows, while also quantifying the trade-offs in performance loss and discretization. Overall, the work highlights a key vulnerability in RLHF/DPO pipelines and provides a concrete framework and tools for stress-testing and robustness assessment against low-cost data-poisoning threats.

Abstract

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Paper Structure

This paper contains 45 sections, 11 theorems, 53 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Suppose that asm holds. Let $\zeta = \theta - \theta_O$. Then, the minimum cost poisoning attack problem eq:prob-gen is equivalently formulated as a convex optimization problem with linear equality and inequality conditions: where $\leqslant$ denotes element-wise comparison, as used hereafter.

Figures (3)

  • Figure 1: Cost (right) and performance loss rate \ref{['eq:plr']} (left) of the proposed cost minimization, PCM, for random flip attack. Results of 5 trials (points) as well as their median (lines). Minimized: the cost of $\theta_A^*$ before discretization, Original: the cost of $\theta_A$, Lower bound: \ref{['eq:lower']}, $\lVert\Phi^\dagger\Phi (\theta_A - \theta_O)\rVert_1$: a term appearing in the upper bound \ref{['eq:upper']}. The other lines are the performance loss rate and the cost of the proposed attack with discretization using different granularity $m$. Missing data points in the preference loss rate indicate no performance loss because $\theta_A = \theta_A^*$ (no cost reduction as well).
  • Figure 2: Cost (right) and performance loss rate (left) of the proposed cost minimization, PCM, for RLHFPoison attack. See the caption of \ref{['fig:random']} for details.
  • Figure 3: Output length distribution. Top: social-reasoning-rlhf, Middle: pku-saferlhf, Bottom: hh-rlhf. Left: Phi-3.5-mini, Center: LLaMA-2-7b, Right: LLaMA-2-13b.

Theorems & Definitions (22)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • remark 1
  • Proposition 4
  • Proposition 5
  • Theorem 6
  • Lemma 7
  • Lemma 8
  • Lemma 9
  • ...and 12 more