A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
David Dobre, Mehrnaz Mofakhami, Sophie Xhonneux, Leo Schwinn, Gauthier Gidel
TL;DR
The paper addresses the brittleness of post-training safety methods for LLMs by proposing a model-native red-flag token that the model emits when harmful content is detected or imminent. This token is trained via a three-term loss that preserves overall utility while signaling harmfulness through the generation process, enabling both hard filtering and soft, reflective safety reasoning via in-context learning. The approach demonstrates strong robustness to adversarial attacks, preserves task performance across multiple models, and generalizes to languages beyond training data, including Japanese in some cases. It also enables reflective safety reasoning and offers a complementary, orthogonal safeguard to existing classifiers and safety fine-tuning, signaling a practical path toward scalable, adaptable safety as LLM capabilities grow.
Abstract
Many safety post-training methods for large language models (LLMs) are designed to modify the model's behaviour from producing unsafe answers to issuing refusals. However, such distribution shifts are often brittle and degrade performance on desirable tasks. To address these pitfalls, we propose augmenting the model's vocabulary with a special red flag token, and training the model to insert this token whenever harmful content is generated or imminent. This approach enables the model to explicitly learn the concept of harmfulness in its representations, with minimal impact on utility due to the marginal change in the generated distribution of natural language. Moreover, because the token is embedded in the model's vocabulary, we can naturally leverage the LLMs' generalization capabilities, such as in-context learning (ICL) and out-of-distribution generalization to languages that are not formally supported (e.g., Japanese for Llama3). In particular, we demonstrate that through ICL alone, the model can learn to initiate reflective reasoning upon generating the red flag token at inference, which steers the response away from harmful continuations or enables self-correction when the flag is raised falsely. This approach is orthogonal and complementary to existing safety technique (such as safety classifiers or standard safety training) and easier to evaluate in comparison to natural language refusals, as it does not require a human or automated judge to assess the harmlessness of the answers.
