SPEX: Scaling Feature Interaction Explanations for LLMs
Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Kannan Ramchandran, Bin Yu
TL;DR
SPEX introduces a scalable, model-agnostic approach to explain feature interactions in large language models by exploiting sparsity in the interaction structure and employing a sparse Fourier transform paired with BCH-code-based channel decoding. The method collects a small set of masked-input evaluations, learns a surrogate function via iterative message passing, and recovers a compact set of high-order interactions that faithfully reconstructs model outputs (improving faithfulness by up to ~20% over marginals on long-context tasks). Evaluations on Sentiment, DROP, and HotpotQA demonstrate SPEX’s ability to identify interactions that align with human annotations and to debug reasoning in closed-source LLMs and multi-modal models. While effective, the approach relies on underlying sparsity and involves nontrivial sample costs; future work explores adaptive masking and integration with internal model structure, as well as connections to sparse attention in transformers.
Abstract
Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions -- common in real-world data -- and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.
