Table of Contents
Fetching ...

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli

TL;DR

The paper tackles latent toxicity in LLMs where safe prompts still yield harmful outputs by targeting internal representations rather than just outputs. It introduces a gradient-based toxicity subspace discovery and an inference-time projection (feature-space editing) that steers away from toxic directions while preserving fluency, formalized via a subspace P constructed from the top-$k$ gradient directions. The authors provide theoretical insights comparing feature-space alignment to weight editing, showing a restricted hypothesis class and locality, and demonstrate empirical toxicity reductions of 8–20% across multiple models on RealToxicityPrompts with only modest perplexity increases, often enhancing utility when combined with detox baselines. The approach is practical for deployment, suggesting future work in extending to multi-modal tasks, adaptive interventions, and human-aligned toxicity definitions to further improve safety and trustworthiness of LLMs.

Abstract

Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

TL;DR

The paper tackles latent toxicity in LLMs where safe prompts still yield harmful outputs by targeting internal representations rather than just outputs. It introduces a gradient-based toxicity subspace discovery and an inference-time projection (feature-space editing) that steers away from toxic directions while preserving fluency, formalized via a subspace P constructed from the top- gradient directions. The authors provide theoretical insights comparing feature-space alignment to weight editing, showing a restricted hypothesis class and locality, and demonstrate empirical toxicity reductions of 8–20% across multiple models on RealToxicityPrompts with only modest perplexity increases, often enhancing utility when combined with detox baselines. The approach is practical for deployment, suggesting future work in extending to multi-modal tasks, adaptive interventions, and human-aligned toxicity definitions to further improve safety and trustworthiness of LLMs.

Abstract

Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.
Paper Structure (46 sections, 3 theorems, 16 equations, 9 figures, 6 tables)

This paper contains 46 sections, 3 theorems, 16 equations, 9 figures, 6 tables.

Key Result

Proposition 4.1

If for every $A\in\mathcal{A}$ the matrix $\Delta W = W_0 A$ lies in $\mathcal{D}$, then Because language models have $Vocab\gg d$ (large vocabulary, smaller hidden size), the map $A\mapsto W_0 A$ is non-surjective, giving strict containment:

Figures (9)

  • Figure 1: Illustration of LLM behavior on different prompts from RealToxicityPromptsgehman2020realtoxicityprompts. Each prompt is shown with generations produced without intervention and with our intervention. Toxic words are partially masked with *.
  • Figure 2: Effect of removing toxic projection from hidden feature.
  • Figure 3: Mean toxicity (blue circles) and perplexity (orange squares) at each $\beta$, averaged over all layers; shaded bands show $\pm1$ std across layers.
  • Figure 4: Mean toxicity (blue circles) and perplexity (orange squares) at each layer, averaged over all $\beta$; shaded bands show $\pm1$ std across beta.
  • Figure 5: Utility Task Graphs.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Proposition 4.1: Structural containment under linear readout
  • proof
  • Lemma 4.2: Locality of projection based feature updates
  • Corollary 7.1: Feature-space alignment yields tighter generalization bounds