BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT
Jiawen Shi, Yixin Liu, Pan Zhou, Lichao Sun
TL;DR
The paper demonstrates a novel backdoor vulnerability in reinforcement learning fine-tuning for language models (RLHF) by introducing BadGPT, which poisons the reward model and enables trigger-based control during RL training. Using GPT-2 as the PLM and DistillBert as the reward model with IMDB feedback, the authors show high clean accuracy alongside strong attack success when the trigger is present. This reveals a tangible security risk from unauthorized third-party RLHF components and prompts the development of defenses and secure deployment practices. The work lays groundwork for evaluating scale-up scenarios and more robust defenses against backdoor injections in RL-based NLP systems. The practical impact is a call to fortify RLHF pipelines against supply-chain and backdoor threats in real-world AI deployments.
Abstract
Recently, ChatGPT has gained significant attention in research due to its ability to interact with humans effectively. The core idea behind this model is reinforcement learning (RL) fine-tuning, a new paradigm that allows language models to align with human preferences, i.e., InstructGPT. In this study, we propose BadGPT, the first backdoor attack against RL fine-tuning in language models. By injecting a backdoor into the reward model, the language model can be compromised during the fine-tuning stage. Our initial experiments on movie reviews, i.e., IMDB, demonstrate that an attacker can manipulate the generated text through BadGPT.
