Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

Himanshu Singh; Ziwei Xu; A. V. Subramanyam; Mohan Kankanhalli

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli

TL;DR

The paper tackles latent toxicity in LLMs where safe prompts still yield harmful outputs by targeting internal representations rather than just outputs. It introduces a gradient-based toxicity subspace discovery and an inference-time projection (feature-space editing) that steers away from toxic directions while preserving fluency, formalized via a subspace P constructed from the top-$k$ gradient directions. The authors provide theoretical insights comparing feature-space alignment to weight editing, showing a restricted hypothesis class and locality, and demonstrate empirical toxicity reductions of 8–20% across multiple models on RealToxicityPrompts with only modest perplexity increases, often enhancing utility when combined with detox baselines. The approach is practical for deployment, suggesting future work in extending to multi-modal tasks, adaptive interventions, and human-aligned toxicity definitions to further improve safety and trustworthiness of LLMs.

Abstract

Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

TL;DR

gradient directions. The authors provide theoretical insights comparing feature-space alignment to weight editing, showing a restricted hypothesis class and locality, and demonstrate empirical toxicity reductions of 8–20% across multiple models on RealToxicityPrompts with only modest perplexity increases, often enhancing utility when combined with detox baselines. The approach is practical for deployment, suggesting future work in extending to multi-modal tasks, adaptive interventions, and human-aligned toxicity definitions to further improve safety and trustworthiness of LLMs.

Abstract

Paper Structure (46 sections, 3 theorems, 16 equations, 9 figures, 6 tables)

This paper contains 46 sections, 3 theorems, 16 equations, 9 figures, 6 tables.

Introduction
Related Work
Output-Level Toxicity Mitigation.
Tuning-Based Alignment Methods.
Mechanistic and Editing-Based Approaches.
Methodology
Collecting Toxic Continuations
Hidden State Extraction and Toxicity Annotation
Hidden state collection.
Token-level toxicity attribution.
Gradient-Based Toxicity Subspace Discovery
Inference-Time Toxicity Steering
Theoretical Insights: Feature Space Alignment vs Weight Editing
Preliminaries
Head-space editing.
...and 31 more sections

Key Result

Proposition 4.1

If for every $A\in\mathcal{A}$ the matrix $\Delta W = W_0 A$ lies in $\mathcal{D}$, then Because language models have $Vocab\gg d$ (large vocabulary, smaller hidden size), the map $A\mapsto W_0 A$ is non-surjective, giving strict containment:

Figures (9)

Figure 1: Illustration of LLM behavior on different prompts from RealToxicityPromptsgehman2020realtoxicityprompts. Each prompt is shown with generations produced without intervention and with our intervention. Toxic words are partially masked with *.
Figure 2: Effect of removing toxic projection from hidden feature.
Figure 3: Mean toxicity (blue circles) and perplexity (orange squares) at each $\beta$, averaged over all layers; shaded bands show $\pm1$ std across layers.
Figure 4: Mean toxicity (blue circles) and perplexity (orange squares) at each layer, averaged over all $\beta$; shaded bands show $\pm1$ std across beta.
Figure 5: Utility Task Graphs.
...and 4 more figures

Theorems & Definitions (4)

Proposition 4.1: Structural containment under linear readout
proof
Lemma 4.2: Locality of projection based feature updates
Corollary 7.1: Feature-space alignment yields tighter generalization bounds

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

TL;DR

Abstract

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)