Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention
Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli
TL;DR
The paper tackles latent toxicity in LLMs where safe prompts still yield harmful outputs by targeting internal representations rather than just outputs. It introduces a gradient-based toxicity subspace discovery and an inference-time projection (feature-space editing) that steers away from toxic directions while preserving fluency, formalized via a subspace P constructed from the top-$k$ gradient directions. The authors provide theoretical insights comparing feature-space alignment to weight editing, showing a restricted hypothesis class and locality, and demonstrate empirical toxicity reductions of 8–20% across multiple models on RealToxicityPrompts with only modest perplexity increases, often enhancing utility when combined with detox baselines. The approach is practical for deployment, suggesting future work in extending to multi-modal tasks, adaptive interventions, and human-aligned toxicity definitions to further improve safety and trustworthiness of LLMs.
Abstract
Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.
