Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Xavier Suau; Pieter Delobelle; Katherine Metcalf; Armand Joulin; Nicholas Apostoloff; Luca Zappella; Pau Rodríguez

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, Pau Rodríguez

TL;DR

This work tackles toxicity in large language models by identifying toxicity-associated expert neurons and damping their activations. It introduces AurA, a hyperparameter-free method that scales each expert's influence according to its AUROC-derived toxicity expertise, avoiding extensive hyperparameter tuning. Empirical results show AurA achieves up to 2.2x toxicity reduction with only a small perplexity cost, across models ranging from 1.5B to 40B parameters, and it complements pre-prompting to further reduce toxicity, including under adversarial prompts. AurA also preserves zero-shot common-sense reasoning and shifts toxic data modes to out-of-distribution, offering a practical and scalable safety enhancement for deploying safer LLMs.

Abstract

An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AurA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AurA can achieve up to $2.2 \times$ reduction in toxicity with only a $0.72$ perplexity increase. We also show that AurA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. AurA can be combined with pre-prompting strategies, boosting its average mitigation potential from $1.28\times$ to $2.35\times$. Moreover, AurA can counteract adversarial pre-prompts that maliciously elicit toxic content, making it an effective method for deploying safer and less toxic models.

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

TL;DR

Abstract

reduction in toxicity with only a

perplexity increase. We also show that AurA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. AurA can be combined with pre-prompting strategies, boosting its average mitigation potential from

. Moreover, AurA can counteract adversarial pre-prompts that maliciously elicit toxic content, making it an effective method for deploying safer and less toxic models.

Paper Structure (26 sections, 3 equations, 10 figures, 10 tables, 4 algorithms)

This paper contains 26 sections, 3 equations, 10 figures, 10 tables, 4 algorithms.

Introduction
Revisiting self-conditioning LLMs
Whispering Experts
AurA
Experimental Results
LLMs with AurA show less toxicity
Interaction with Pre-prompting
The Effect of AurA on Common-Sense Reasoning
AurA Shifts Toxic Data Modes to OOD
Ablation Study
Related Work
Limitations and Future Work
Conclusion
Algorithms
Pareto Fronts of Toxicity vs. PPL$_{WIK}$ for Different Models
...and 11 more sections

Figures (10)

Figure 1: AurA mitigates toxicity with small impact in perplexity. (Top) Neurons with high toxicity expertise are dampened more strongly, yielding a less toxic LLM. (Middle) We show the toxicity reduction between the original model (circles) and using our AurA intervention (stars), for different LLMs. PPL stands for Perplexity and RTP refers to the Real Toxicity Prompts dataset. (Bottom) Results pre-prompting Falcon-7B-instruct with a pre-prompt that induces toxicity. AurA mitigates toxicity even when the pre-prompt is adversarial.
Figure 2: Pareto front of RTP toxicity vs. Perplexity on Wikipedia on the MPT-7B model. (Top) Search for $\alpha$ in Damp, we observe an optimal value at $\alpha=0.5$. (Bottom) $\text{Det}_\text{zero}$ and Damp with $\alpha=0.5$ (best $\alpha$ found) for different $k$, shown next to dots. In gray, Damp with an intervention on random sets of experts (5 runs). We include our non-parametric method AurA for reference, detailed in \ref{['sec:whispx']}.
Figure 3: When combined with the pre-prompting, AurA exhibits a significantly positive impact. We show RTP Toxicity using Falcon-7B-instruct when pre-prompting the model with different favorable (Non-toxic) or adversarial (Toxic) pre-prompts. AurA is able to mitigate toxicity in all scenarios by $2.35\times$ on average, shown as the difference between circles (without AurA) and stars. Our method shows robustness even when facing extremely adversarial pre-prompts. The gray circle corresponds to the original model without pre-prompt.
Figure 4: Impact of AurA on perplexity. We measure the perplexity change on non-toxic (blue) and toxic (red) corpora. The perplexity remains low and unchanged for non-toxic corpora (a mean increase of $+1.39$) and strongly increases for toxic ones (a median increase of $+193.46$). This indicates that AurA reduces the likelihood of toxic data modes.
Figure 5: Pareto fronts of toxicity vs. perplexity when sweeping $k$ (shown next to dots) for $\text{Det}_\text{zero}$ and Damp (for an optimal $\alpha=0.5)$, and the DExperts parameter in \ref{['fig:front_gpt2xl']}, for different models and methods. The dots with black border show the model performance at no conditioning (i.e.,$k=0$ for $\text{Det}_\text{zero}$ and Damp, and DExperts parameter equal to 0).
...and 5 more figures

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

TL;DR

Abstract

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)