Table of Contents
Fetching ...

SPEX: Scaling Feature Interaction Explanations for LLMs

Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Kannan Ramchandran, Bin Yu

TL;DR

SPEX introduces a scalable, model-agnostic approach to explain feature interactions in large language models by exploiting sparsity in the interaction structure and employing a sparse Fourier transform paired with BCH-code-based channel decoding. The method collects a small set of masked-input evaluations, learns a surrogate function via iterative message passing, and recovers a compact set of high-order interactions that faithfully reconstructs model outputs (improving faithfulness by up to ~20% over marginals on long-context tasks). Evaluations on Sentiment, DROP, and HotpotQA demonstrate SPEX’s ability to identify interactions that align with human annotations and to debug reasoning in closed-source LLMs and multi-modal models. While effective, the approach relies on underlying sparsity and involves nontrivial sample costs; future work explores adaptive masking and integration with internal model structure, as well as connections to sparse attention in transformers.

Abstract

Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions -- common in real-world data -- and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.

SPEX: Scaling Feature Interaction Explanations for LLMs

TL;DR

SPEX introduces a scalable, model-agnostic approach to explain feature interactions in large language models by exploiting sparsity in the interaction structure and employing a sparse Fourier transform paired with BCH-code-based channel decoding. The method collects a small set of masked-input evaluations, learns a surrogate function via iterative message passing, and recovers a compact set of high-order interactions that faithfully reconstructs model outputs (improving faithfulness by up to ~20% over marginals on long-context tasks). Evaluations on Sentiment, DROP, and HotpotQA demonstrate SPEX’s ability to identify interactions that align with human annotations and to debug reasoning in closed-source LLMs and multi-modal models. While effective, the approach relies on underlying sparsity and involves nontrivial sample costs; future work explores adaptive masking and integration with internal model structure, as well as connections to sparse attention in transformers.

Abstract

Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths (). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths (. SPEX exploits underlying natural sparsity among interactions -- common in real-world data -- and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.

Paper Structure

This paper contains 63 sections, 1 theorem, 40 equations, 10 figures, 3 tables, 4 algorithms.

Key Result

Lemma 1.1

If $\left\lvert \mathbf n \right\rvert + \left\lvert \mathbf k^* \right\rvert \leq t$, where $\mathbf n$ is the additive noise in $\mathbb F_2$ induced by the noisy process in eq:ratio and the estimation procedure in Algorithm alg:bch-hard, then we can recover $\mathbf k^*$.

Figures (10)

  • Figure 1: (a) Sentiment analysis: SPEX identifies the double negative "never fails". Marginal approaches assign positive attributions to "never" and "fails". (b) Retrieval augmented generation: SPEX explains the output of a RAG pipeline, finding a combination of documents the LLM used to answer the question and ignoring unimportant information. (c) Visual question answering: SPEX identifies interaction between image patches required to correctly summarize the image.
  • Figure 2: Marginal attribution approaches scale to large $n$, but do not capture interactions. Interaction indices only work for small $n$. SPEX computes interactions and scales.
  • Figure 3: SPEX utilizes channel codes to determine masking patterns. We observe the changes in model output depending on the used mask. SPEX uses message passing to learn a surrogate function to generate interaction-based explanations.
  • Figure 4: (a) SPEX uniformly outperforms all baselines in terms of faithfulness. High order Faith-Banzhaf indices have competitive faithfulness, but rapidly increase in computational cost. (b) The DROP dataset contains only larger examples, so we primarily compare against first order methods. (c) Our approach remains competitive in this task as well, and still outperforms marginal approaches for large $n$.
  • Figure 5: Depiction of the message passing algorithm for computing the surrogate function in SPEX.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Lemma 1.1
  • proof