Table of Contents
Fetching ...

Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content

Federico Bianchi, James Zou

TL;DR

The paper identifies a vulnerability in LLM safety: safe prompts and outputs can be transformed post-hoc into harmful content via Bait-and-Switch attacks, undermining guardrails that only scrutinize verbatim generations. It presents proxies as easily discoverable stand-ins for target concepts and demonstrates an automated pipeline that yields medical misinformation, fake political content, and toxic material when proxies are swapped after generation. By testing on GPT-4 and Claude-2, the authors show this threat arises from instruction-following and proxy-based tailoring, raising the need for post-generation safeguards. They discuss partial mitigations such as watermarking and fact-checking, but emphasize that robust defense requires broader strategies and safety education to anticipate and counter post-hoc manipulation.

Abstract

The risks derived from large language models (LLMs) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations.

Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content

TL;DR

The paper identifies a vulnerability in LLM safety: safe prompts and outputs can be transformed post-hoc into harmful content via Bait-and-Switch attacks, undermining guardrails that only scrutinize verbatim generations. It presents proxies as easily discoverable stand-ins for target concepts and demonstrates an automated pipeline that yields medical misinformation, fake political content, and toxic material when proxies are swapped after generation. By testing on GPT-4 and Claude-2, the authors show this threat arises from instruction-following and proxy-based tailoring, raising the need for post-generation safeguards. They discuss partial mitigations such as watermarking and fact-checking, but emphasize that robust defense requires broader strategies and safety education to anticipate and counter post-hoc manipulation.

Abstract

The risks derived from large language models (LLMs) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations.
Paper Structure (13 sections, 1 figure)

This paper contains 13 sections, 1 figure.

Figures (1)

  • Figure 1: Faithful responses from large language models can be transformed into problematic content with simple find and replacement techniques. The example on the left is generated with Claude-2, and the example on the right is generated with GPT-4.