Table of Contents
Fetching ...

Removing Spurious Correlation from Neural Network Interpretations

Milad Fotouhi, Mohammad Taha Bahadori, Oluwaseyi Feyisetan, Payman Arabshahi, David Heckerman

TL;DR

This work tackles spurious correlations in neural network interpretation caused by topic confounding when attributing toxicity to internal units. It introduces a causal mediation framework with an entropy-balancing estimator to compute the natural indirect effect of internal units on harmful outputs, conditioning on topic, and demonstrates a single forward-pass approach that avoids retraining. Empirically, the method applied to two LLMs on RealToxicityPrompts shows that correcting for topic makes toxicity attribution more widespread across units, challenging the notion of localized toxicity. The approach offers a principled, generalizable way to debias interpretability analyses and can extend to other confounders beyond topic, enhancing safety and transparency in language models.

Abstract

The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.

Removing Spurious Correlation from Neural Network Interpretations

TL;DR

This work tackles spurious correlations in neural network interpretation caused by topic confounding when attributing toxicity to internal units. It introduces a causal mediation framework with an entropy-balancing estimator to compute the natural indirect effect of internal units on harmful outputs, conditioning on topic, and demonstrates a single forward-pass approach that avoids retraining. Empirically, the method applied to two LLMs on RealToxicityPrompts shows that correcting for topic makes toxicity attribution more widespread across units, challenging the notion of localized toxicity. The approach offers a principled, generalizable way to debias interpretability analyses and can extend to other confounders beyond topic, enhancing safety and transparency in language models.

Abstract

The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.

Paper Structure

This paper contains 10 sections, 8 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: We quantify the Natural Indirect Effect (NIE) in conversations using the above DAG. The conversation topic $\mathbf{x}$ influences both the Question $\mathbf{q}$ and the harmfulness (e.g., bias or toxicity) of the LLM generations $\mathrm{y}$. Our goal is to use this graph for every node and find the amount of effect that goes through a particular internal node $\mathbf{n}$ (e.g., activations of internal neurons). Previous studies do not consider the impact of conversation topic $\mathbf{x}$ (marked by red color).
  • Figure 2: Contributions of different MLPs in generation of toxic outputs. We measure the average indirect effect, mediated by each MLP.
  • Figure 3: The t-SNE embedding of 1000 randomly selected queries and the clusters identified by k-means in the RealToxicityPrompts dataset gehman2020realtoxicityprompts.