Table of Contents
Fetching ...

GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

Zenghao Duan, Zhiyi Yin, Zhichao Shi, Liang Pang, Shaoling Jing, Jiayi Wu, Yu Yan, Huawei Shen, Xueqi Cheng

TL;DR

This work reframes toxicity in LLMs as a global, low-dimensional subspace problem within FFNs, challenging the view that toxic outputs arise from isolated toxic vectors or layer-specific directions. It introduces GloSS, a four-stage, training-free detoxification method that identifies a global toxic subspace via SVD and PCA across layers and then removes toxicity by projecting FFN value matrices onto the orthogonal complement of this subspace. Across multiple open-source LLMs, GloSS achieves strong detoxification with minimal impact on general language abilities, outperforming SSFT, DPO, and ProFS while using far fewer toxic training samples. The findings highlight a compact toxic structure and offer a practical, data-efficient safeguard for deploying safer LLMs without retraining. The approach has practical significance for real-world AI safety, enabling targeted interventions that disrupt toxic directions without erasing broad linguistic capabilities.

Abstract

This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.

GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

TL;DR

This work reframes toxicity in LLMs as a global, low-dimensional subspace problem within FFNs, challenging the view that toxic outputs arise from isolated toxic vectors or layer-specific directions. It introduces GloSS, a four-stage, training-free detoxification method that identifies a global toxic subspace via SVD and PCA across layers and then removes toxicity by projecting FFN value matrices onto the orthogonal complement of this subspace. Across multiple open-source LLMs, GloSS achieves strong detoxification with minimal impact on general language abilities, outperforming SSFT, DPO, and ProFS while using far fewer toxic training samples. The findings highlight a compact toxic structure and offer a practical, data-efficient safeguard for deploying safer LLMs without retraining. The approach has practical significance for real-world AI safety, enabling targeted interventions that disrupt toxic directions without erasing broad linguistic capabilities.

Abstract

This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.

Paper Structure

This paper contains 17 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Removing toxic vectors do not alter the underlying toxic subspace. (b) Layer-wise subspaces are limited and fail to capture complete toxic features. (c) Global toxic subspace provides a more faithful representation of toxic region.
  • Figure 2: Results of Different Operations on Activation of Vectors. (a) Enhance different numbers of toxic and non-toxic value vector activations, selectively; (b) Suppress toxic vector activations at different proportions; (c) Reversing value vector activations steers the FFN blocks either toward or away from the toxic direction.
  • Figure 3: Top-5 Toxic Directions Across Layers. They are primarily located in the middle-to-late layers and exhibit pairwise cosine similarities close to 1.
  • Figure 4: The overview of GloSS. It identifies and removes the global toxic subspace through a four-stage procedure to effectively reduce toxic generation. The intervention is applied by modifying $W_{proj}$ in the FFN modules.
  • Figure 5: Effectiveness of Extracted vs. Random Subspaces in Toxicity Reduction. Noop denotes the original model without any modification.
  • ...and 1 more figures