Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka; Keita Saito; Mikoto Kudo; Takumi Tanabe; Akifumi Wachi; Youhei Akimoto

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka, Keita Saito, Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

TL;DR

The paper investigates the fundamental cost of poisoning LLM alignment via label-flipping during RLHF/DPO, reframing the attack as steering the reward model to a target function with minimal label flips. It derives convex/linear programming bounds for the minimal flip cost under fixed and adaptive embeddings and introduces a practical post-processing method (PCM) to minimize flips for any existing attack while preserving poisoning effects. Theoretical results show that attacks can be significantly cheaper when the reward-feature dimension is small relative to data and that adaptive embedding scenarios admit worst-case low-cost attacks under suitable representational capacity. Empirical evaluations on synthetic data and public LLM datasets demonstrate substantial cost reductions with PCM, especially as dataset size grows, while also quantifying the trade-offs in performance loss and discretization. Overall, the work highlights a key vulnerability in RLHF/DPO pipelines and provides a concrete framework and tools for stress-testing and robustness assessment against low-cost data-poisoning threats.

Abstract

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

TL;DR

Abstract

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (22)