Table of Contents
Fetching ...

IF-GUIDE: Influence Function-Guided Detoxification of LLMs

Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong

TL;DR

IF-GUIDE introduces a proactive toxicity mitigation technique that prevents toxic knowledge from being learned by LLMs. It leverages token-level influence-function attribution, differential toxicity signals, and a penalty-based training objective, enabled by EK-FAC and proxy models to scale to large models. The approach achieves substantial reductions in both explicit and implicit toxicity, outperforming filtering and post-hoc alignment methods while maintaining fluency, and remains effective when applied during pre-training or fine-tuning. The work includes mechanistic analyses and adversarial robustness tests, showing toxicity suppression is achieved without encoding toxicity in internal representations and can be integrated with existing defenses for even stronger safety guarantees.

Abstract

We study how training data contributes to the emergence of toxic behaviors in large language models. Most prior work on reducing model toxicity adopts reactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactive approach, IF-GUIDE, that leverages influence functions to identify and suppress harmful tokens in the training data. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-GUIDE does not rely on human-preference data, which is typically required by existing alignment methods. In our evaluation, we demonstrate that IF-GUIDE substantially reduces both explicit and implicit toxicity-by up to 10$\times$ compared to uncensored models, and up to 3$\times$ compared to baseline alignment methods such as DPO and RAD-across both pre-training and fine-tuning scenarios. IF-GUIDE is computationally efficient: a billion-parameter model is not necessary for computing influence scores; a million-parameter model-with 7.5$\times$ fewer parameters-can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide

IF-GUIDE: Influence Function-Guided Detoxification of LLMs

TL;DR

IF-GUIDE introduces a proactive toxicity mitigation technique that prevents toxic knowledge from being learned by LLMs. It leverages token-level influence-function attribution, differential toxicity signals, and a penalty-based training objective, enabled by EK-FAC and proxy models to scale to large models. The approach achieves substantial reductions in both explicit and implicit toxicity, outperforming filtering and post-hoc alignment methods while maintaining fluency, and remains effective when applied during pre-training or fine-tuning. The work includes mechanistic analyses and adversarial robustness tests, showing toxicity suppression is achieved without encoding toxicity in internal representations and can be integrated with existing defenses for even stronger safety guarantees.

Abstract

We study how training data contributes to the emergence of toxic behaviors in large language models. Most prior work on reducing model toxicity adopts reactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactive approach, IF-GUIDE, that leverages influence functions to identify and suppress harmful tokens in the training data. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-GUIDE does not rely on human-preference data, which is typically required by existing alignment methods. In our evaluation, we demonstrate that IF-GUIDE substantially reduces both explicit and implicit toxicity-by up to 10 compared to uncensored models, and up to 3 compared to baseline alignment methods such as DPO and RAD-across both pre-training and fine-tuning scenarios. IF-GUIDE is computationally efficient: a billion-parameter model is not necessary for computing influence scores; a million-parameter model-with 7.5 fewer parameters-can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide

Paper Structure

This paper contains 36 sections, 11 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: Standard influence function results. We remove the most influential training examples and report toxicity and fluency after re-training Pythia-160M. Arrows indicate the preferred direction for each metric.
  • Figure 2: Fine-tuning toxicity reduction results. Toxicity and fluency on RTP for base models fine-tuned with IF-Guide for up to 800M tokens. Models are evaluated every $\sim$130M tokens (or $\sim$260M for Pythia-12B, due to compute constraints).
  • Figure 3: Impact of the proxy model. Each subplot corresponds to a model trained with IF-Guide. Bars show the toxicity and fluency when using different proxy models to select toxic tokens.
  • Figure 4: Layerwise toxicity results for Pythia-1B. For prompts where the base model predicts a toxic token, we report the average probability of toxic tokens across layers using Logit Lens LogitLens.
  • Figure 5: Controlling the toxicity direction in Pythia-1B. The EMT and TP on 1,000 prompts from RTP after adding a scaled toxicity direction to each model’s final‑layer activations.
  • ...and 4 more figures