LLM Misalignment via Adversarial RLHF Platforms
Erfan Entezami, Ali Naseh
TL;DR
This paper investigates security risks in RLHF platforms by proposing an adversarial framework that can misalign LLMs during RLHF fine-tuning. It introduces a task-specific classifier $\Theta$ embedded in the platform to identify samples related to targeted topics and performs a label-flipping attack on the targeted preference data, poisoning the reward model $R$ and shifting the RLHF objective. The authors model RLHF as a PbRL problem with a reward model learned from pairwise preferences via a Bradley–Terry–Lu ce formulation $Pr(o \succ o'|c) = \sigma(R(c,o) - R(c,o'))$ and a KL-regularized objective $R_{\phi}(c,o) = R(c,o) - \beta D_{KL}(\pi(o|c) || \pi_{ref}(o|c))$, optimizing $\hat J(\pi)$. Experimental results on small GPT-2 variants show that manipulating as little as 25% of targeted data can shift reward distributions and induce misalignment in the full PPO-tuned pipeline, with attack effects varying by model size. The work highlights the need for security analyses and defenses for RLHF platforms to prevent platform-level data poisoning and backdoor effects.
Abstract
Reinforcement learning has shown remarkable performance in aligning language models with human preferences, leading to the rise of attention towards developing RLHF platforms. These platforms enable users to fine-tune models without requiring any expertise in developing complex machine learning algorithms. While these platforms offer useful features such as reward modeling and RLHF fine-tuning, their security and reliability remain largely unexplored. Given the growing adoption of RLHF and open-source RLHF frameworks, we investigate the trustworthiness of these systems and their potential impact on behavior of LLMs. In this paper, we present an attack targeting publicly available RLHF tools. In our proposed attack, an adversarial RLHF platform corrupts the LLM alignment process by selectively manipulating data samples in the preference dataset. In this scenario, when a user's task aligns with the attacker's objective, the platform manipulates a subset of the preference dataset that contains samples related to the attacker's target. This manipulation results in a corrupted reward model, which ultimately leads to the misalignment of the language model. Our results demonstrate that such an attack can effectively steer LLMs toward undesirable behaviors within the targeted domains. Our work highlights the critical need to explore the vulnerabilities of RLHF platforms and their potential to cause misalignment in LLMs during the RLHF fine-tuning process.
