Table of Contents
Fetching ...

Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

Michael Matena, Colin Raffel

TL;DR

NPEFF presents a scalable framework to uncover model processing strategies by decomposing per-example Fisher information into non-negative, rank-1 PSD components called pseudo-Fisher vectors. By representing per-example perturbations with low-rank matrices, applying randomized projections and rank reduction, and solving a non-negative decomposition, the method reveals interpretable processing strategies and enables perturbations that selectively disrupt them. Across SNLI, SST2, and TriviaQA in zero-shot settings, NPEFF components align with predictions or labels and demonstrate robustness on held-out data, with perturbations effectively targeting component-specific behavior. The work shows practical utility in analyzing unlearning and in-context learning, outperforms gradient clustering and activation SAEs as a basis for mechanistic insight, and provides a release-ready toolkit for researchers exploring model processing strategies.

Abstract

We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study in-context learning. Furthermore, we demonstrate the advantages of NPEFF over baselines such as gradient clustering and using sparse autoencoders for dictionary learning over model activations.

Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

TL;DR

NPEFF presents a scalable framework to uncover model processing strategies by decomposing per-example Fisher information into non-negative, rank-1 PSD components called pseudo-Fisher vectors. By representing per-example perturbations with low-rank matrices, applying randomized projections and rank reduction, and solving a non-negative decomposition, the method reveals interpretable processing strategies and enables perturbations that selectively disrupt them. Across SNLI, SST2, and TriviaQA in zero-shot settings, NPEFF components align with predictions or labels and demonstrate robustness on held-out data, with perturbations effectively targeting component-specific behavior. The work shows practical utility in analyzing unlearning and in-context learning, outperforms gradient clustering and activation SAEs as a basis for mechanistic insight, and provides a release-ready toolkit for researchers exploring model processing strategies.

Abstract

We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study in-context learning. Furthermore, we demonstrate the advantages of NPEFF over baselines such as gradient clustering and using sparse autoencoders for dictionary learning over model activations.
Paper Structure (51 sections, 8 equations, 5 figures, 12 tables)

This paper contains 51 sections, 8 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Top examples of selected components from NPEFF decompositions in \ref{['sec:comp_tunings']}.
  • Figure 2: Log-scale plot of the mean decay of normalized eigenvalues of PEF matrices with 64 expectation random projections.
  • Figure 3: Top examples of random components from NPEFF decompositions for SNLI in \ref{['sec:comp_tunings']}. The [P] denotes the premise and the [H] denotes the hypothesis.
  • Figure 4: Top examples of random components from NPEFF decompositions for SST2 in \ref{['sec:comp_tunings']}.
  • Figure 5: Top examples of random components from NPEFF decompositions for TriviaQA in \ref{['sec:comp_tunings']}.