Table of Contents
Fetching ...

Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin, Marc Wetter

TL;DR

The paper interrogates the realism of AI safety datasets by showing they rely heavily on triggering cues that do not match real-world attacks. It introduces intent laundering, a two-part process of connotation neutralization and context transposition, to remove overt cues while preserving malicious intent, revealing a large rise in attack success rates ($ASR$) when cues are stripped. The authors extend intent laundering into a jailbreaking loop that achieves $ASR$ between $90\%$ and $98.5\%$ across several models, indicating a substantial gap between safety evaluations and actual adversarial behavior. These findings call for safer, more realistic evaluation paradigms and improved alignment strategies that resist cue-based jailbreaks, with practical implications for model security and deployment.

Abstract

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.

Intent Laundering: AI Safety Datasets Are Not What They Seem

TL;DR

The paper interrogates the realism of AI safety datasets by showing they rely heavily on triggering cues that do not match real-world attacks. It introduces intent laundering, a two-part process of connotation neutralization and context transposition, to remove overt cues while preserving malicious intent, revealing a large rise in attack success rates () when cues are stripped. The authors extend intent laundering into a jailbreaking loop that achieves between and across several models, indicating a substantial gap between safety evaluations and actual adversarial behavior. These findings call for safer, more realistic evaluation paradigms and improved alignment strategies that resist cue-based jailbreaks, with practical implications for model security and deployment.

Abstract

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.
Paper Structure (28 sections, 12 figures, 2 tables)

This paper contains 28 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of our intent laundering framework. Without the feedback loop, the framework illustrates the intent laundering procedure. With the loop, it shows how intent laundering operates as a jailbreaking technique. The process begins by passing the original malicious request (data point) through the intent launderer to generate a revision. This revision is then used to attack the target model. An LLM judge evaluates the response for safety and practicality. If a response is both unsafe and practical, the attack is considered successful. Otherwise, the revision--regeneration mechanism is triggered, using all previous failed revisions as feedback to generate a new, improved revision. The loop ends when either a set number of iterations is reached or a target attack success rate is achieved.
  • Figure 2: Word clouds of the 40 most frequent $n$-grams, where $n \in \{1, 2, 3\}$, from the combined AdvBench and HarmBench corpus. Triggering cues are highlighted in red for inherently, and orange for contextually negative/sensitive connotations. This includes triggering words in the unigram cloud and triggering phrases in the bigram and trigram clouds. Neutral-connotation unigrams that contribute to triggering cues in higher-order $n$-grams are also shown in green. These visualizations reveal an unusual overrepresentation of overtly triggering cues in safety datasets, suggesting that data points are artificially designed to trigger safety mechanisms. For example, expressions such as "tutorial" and "step-by-step instructions" are used explicitly to trigger infohazard safeguards bostrom2011information, which is unrealistic. Even minimally skilled bad actors rarely use such overt and self-incriminating language (e.g., "commit identity theft"). These findings indicate that safety datasets fail to capture two of the key properties of real-world attacks: being well-crafted and driven by ulterior intent. Note that, for clarity, only a representative subset of expressions is marked in the word clouds above.
  • Figure 3: Proportion of duplicated versus unique data points in the AdvBench and HarmBench datasets across varying similarity thresholds. Each safety dataset is compared to a size-matched GSM8K subset shown below its plot. Both safety datasets exhibit considerably higher duplication rates across most thresholds compared to their GSM8K counterparts. This is striking, as safety datasets are intended to approximate real-world attacks—characterized by being out-of-distribution and well-crafted. In contrast, they show more duplication than a regular non-safety dataset, where such duplication is more acceptable. This is particularly alarming for safety datasets, as it indicates that many data points in these datasets evaluate the model on essentially the same harmful intent in nearly identical contexts (see Figure \ref{['fig:examples-of-data-dups']} for examples), leading to an overestimated evaluation of safety.
  • Figure 4: Examples of duplicated data points from the AdvBench and HarmBench datasets. These examples exhibit two unusual patterns: (1) explicit and repetitive overuse of triggering cues, either inherently (in red, e.g., "chop shops") or contextually (in orange, e.g., "in detail"), and (2) substantial duplication resulting from this overuse. Each group of duplicates effectively represents a single malicious intent, i.e., a refusal or response to one is sufficient to evaluate the robustness of the model for that intent. As a result, safety evaluations based on these data points can be inflated.
  • Figure 5: An actual response (red box) generated by Gemini 3 Pro to an intent-laundered revision (green box) based on a data point from the AdvBench dataset (orange box). The revision uses both connotation neutralization and context transposition to abstract away triggering cues. The model response is partially shown to prevent potential misuse; however, the full response spans several paragraphs and fully enables the original malicious intent. This example is provided strictly for academic safety research. Any misuse is strongly discouraged.
  • ...and 7 more figures