Table of Contents
Fetching ...

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

David Dobre, Mehrnaz Mofakhami, Sophie Xhonneux, Leo Schwinn, Gauthier Gidel

TL;DR

The paper addresses the brittleness of post-training safety methods for LLMs by proposing a model-native red-flag token that the model emits when harmful content is detected or imminent. This token is trained via a three-term loss that preserves overall utility while signaling harmfulness through the generation process, enabling both hard filtering and soft, reflective safety reasoning via in-context learning. The approach demonstrates strong robustness to adversarial attacks, preserves task performance across multiple models, and generalizes to languages beyond training data, including Japanese in some cases. It also enables reflective safety reasoning and offers a complementary, orthogonal safeguard to existing classifiers and safety fine-tuning, signaling a practical path toward scalable, adaptable safety as LLM capabilities grow.

Abstract

Many safety post-training methods for large language models (LLMs) are designed to modify the model's behaviour from producing unsafe answers to issuing refusals. However, such distribution shifts are often brittle and degrade performance on desirable tasks. To address these pitfalls, we propose augmenting the model's vocabulary with a special red flag token, and training the model to insert this token whenever harmful content is generated or imminent. This approach enables the model to explicitly learn the concept of harmfulness in its representations, with minimal impact on utility due to the marginal change in the generated distribution of natural language. Moreover, because the token is embedded in the model's vocabulary, we can naturally leverage the LLMs' generalization capabilities, such as in-context learning (ICL) and out-of-distribution generalization to languages that are not formally supported (e.g., Japanese for Llama3). In particular, we demonstrate that through ICL alone, the model can learn to initiate reflective reasoning upon generating the red flag token at inference, which steers the response away from harmful continuations or enables self-correction when the flag is raised falsely. This approach is orthogonal and complementary to existing safety technique (such as safety classifiers or standard safety training) and easier to evaluate in comparison to natural language refusals, as it does not require a human or automated judge to assess the harmlessness of the answers.

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

TL;DR

The paper addresses the brittleness of post-training safety methods for LLMs by proposing a model-native red-flag token that the model emits when harmful content is detected or imminent. This token is trained via a three-term loss that preserves overall utility while signaling harmfulness through the generation process, enabling both hard filtering and soft, reflective safety reasoning via in-context learning. The approach demonstrates strong robustness to adversarial attacks, preserves task performance across multiple models, and generalizes to languages beyond training data, including Japanese in some cases. It also enables reflective safety reasoning and offers a complementary, orthogonal safeguard to existing classifiers and safety fine-tuning, signaling a practical path toward scalable, adaptable safety as LLM capabilities grow.

Abstract

Many safety post-training methods for large language models (LLMs) are designed to modify the model's behaviour from producing unsafe answers to issuing refusals. However, such distribution shifts are often brittle and degrade performance on desirable tasks. To address these pitfalls, we propose augmenting the model's vocabulary with a special red flag token, and training the model to insert this token whenever harmful content is generated or imminent. This approach enables the model to explicitly learn the concept of harmfulness in its representations, with minimal impact on utility due to the marginal change in the generated distribution of natural language. Moreover, because the token is embedded in the model's vocabulary, we can naturally leverage the LLMs' generalization capabilities, such as in-context learning (ICL) and out-of-distribution generalization to languages that are not formally supported (e.g., Japanese for Llama3). In particular, we demonstrate that through ICL alone, the model can learn to initiate reflective reasoning upon generating the red flag token at inference, which steers the response away from harmful continuations or enables self-correction when the flag is raised falsely. This approach is orthogonal and complementary to existing safety technique (such as safety classifiers or standard safety training) and easier to evaluate in comparison to natural language refusals, as it does not require a human or automated judge to assess the harmlessness of the answers.

Paper Structure

This paper contains 30 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Illustration of using a $\color{BrickRed}\langle\texttt{rf}\rangle$ token for filtering harmfulness or reasoning about safety.
  • Figure 2: The loss terms on harmful continuations: $\color{BrickRed}\langle\texttt{rf}\rangle$ is inserted at a random position $i$; language modelling cross-entropy is used to generate a $\color{BrickRed}\langle\texttt{rf}\rangle$ at all positions up to $i$, and we use a KL divergence to ensure that the model distribution is unaffected after a $\color{BrickRed}\langle\texttt{rf}\rangle$.
  • Figure 3: Model evaluation of the robustness-utility trade-off. The left represents utility benchmarks (higher is better), and the right represents adversarial defence success rates (higher is better). Both refusal and $\color{BrickRed}\langle\texttt{rf}\rangle$ generation are considered a successful defence. Refusals are judged by GPT-5.
  • Figure 4: $\color{BrickRed}\langle\texttt{rf}\rangle$ as a soft signal to trigger safety-focused Chain-of-Thought (CoT) reasoning: Evaluating performance of the RF model with CoT prompting against Llama3.2-3B-IT Base, CAT and RF without CoT prompting on three metrics: Defence Success Rate (DSR) under Prefilling attack on Harmbench and utility on Harmless and XSTest-Safe-Subset. Higher is better.
  • Figure 5: samples of ICL $\color{BrickRed}\langle\texttt{rf}\rangle$-triggered safety reflections in benign (top) and unsafe (bottom) requests.
  • ...and 1 more figures