Table of Contents
Fetching ...

SAFR: Neuron Redistribution for Interpretability

Ruidi Chang, Chunyuan Deng, Hanjie Chen

TL;DR

SAFR tackles interpretability in neural networks by explicitly redistributing neuron usage through two regularizations: monosemantic emphasis for important tokens via a VMASK-based mechanism and polysemantic encouragement for correlated token pairs using attention weights. The joint loss $\mathcal{L}= \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{Imp}} \mathcal{L}_{\mathrm{Imp}} + \lambda_{\mathrm{Inter}} \mathcal{L}_{\mathrm{Inter}}$ guides the model to separate salient features across neurons while preserving interactions. Experiments on SST-2 and IMDB show improved interpretability, quantified by the Superposition Regularization Score (SRS), with minimal impact on accuracy and with clear visualizations of neuron allocation in FFN layers. This work advances mechanistic interpretability in NLP by making neuron utilization more interpretable and amenable to visualization, and it opens avenues for scaling SAFR to larger architectures and broader tasks.

Abstract

Superposition refers to encoding representations of multiple features within a single neuron, which is common in deep neural networks. This property allows neurons to combine and represent multiple features, enabling the model to capture intricate information and handle complex tasks. Despite promising performance, the model's interpretability has been diminished. This paper presents a novel approach to enhance model interpretability by regularizing feature superposition. We introduce SAFR, which simply applies regularizations to the loss function to promote monosemantic representations for important tokens while encouraging polysemanticity for correlated token pairs, where important tokens and correlated token pairs are identified via VMASK and attention weights respectively. We evaluate SAFR with a transformer model on two classification tasks. Experiments demonstrate the effectiveness of SAFR in improving model interpretability without compromising prediction performance. Besides, SAFR provides explanations by visualizing the neuron allocation within the intermediate layers.

SAFR: Neuron Redistribution for Interpretability

TL;DR

SAFR tackles interpretability in neural networks by explicitly redistributing neuron usage through two regularizations: monosemantic emphasis for important tokens via a VMASK-based mechanism and polysemantic encouragement for correlated token pairs using attention weights. The joint loss guides the model to separate salient features across neurons while preserving interactions. Experiments on SST-2 and IMDB show improved interpretability, quantified by the Superposition Regularization Score (SRS), with minimal impact on accuracy and with clear visualizations of neuron allocation in FFN layers. This work advances mechanistic interpretability in NLP by making neuron utilization more interpretable and amenable to visualization, and it opens avenues for scaling SAFR to larger architectures and broader tasks.

Abstract

Superposition refers to encoding representations of multiple features within a single neuron, which is common in deep neural networks. This property allows neurons to combine and represent multiple features, enabling the model to capture intricate information and handle complex tasks. Despite promising performance, the model's interpretability has been diminished. This paper presents a novel approach to enhance model interpretability by regularizing feature superposition. We introduce SAFR, which simply applies regularizations to the loss function to promote monosemantic representations for important tokens while encouraging polysemanticity for correlated token pairs, where important tokens and correlated token pairs are identified via VMASK and attention weights respectively. We evaluate SAFR with a transformer model on two classification tasks. Experiments demonstrate the effectiveness of SAFR in improving model interpretability without compromising prediction performance. Besides, SAFR provides explanations by visualizing the neuron allocation within the intermediate layers.

Paper Structure

This paper contains 28 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Basic structure of SAFR. i) Promote monosemanticity for important tokens after the embedding layers. ii) Leverage the attention mechanism to enhance polysemanticity among correlated token pairs.
  • Figure 2: (a) Important tokens exhibit higher capacity. (b) Circle size represents capacity, with larger circles indicating greater capacity. Red lines denote positive correlations, blue lines indicate negative correlations, and shorter lines indicate stronger correlations. (c) Important tokens demonstrate lower polysemanticity, while correlated token pair exhibit relatively higher interference.
  • Figure 3: Sensitivity to $k$ Selection. As tokens are gradually removed, accuracy declines consistently.
  • Figure 4: Cross Layers Output: Capacity. VMASK layer uses the importance scores it detects, while the attention layer uses normalized attention scores. The original sentence is "Preposterous and tedious, Sonny is spiked with unintentional laughter that, unfortunately, occurs too infrequently to make the film even a guilty pleasure."(negative)
  • Figure 5: Cross Layers Output: Interference. The attention layer uses the attention weight matrix. The original sentence is "Preposterous and tedious, Sonny is spiked with unintentional laughter that, unfortunately, occurs too infrequently to make the film even a guilty pleasure."(negative)