ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, Kannan Ramchandran
TL;DR
ProxySPEX tackles the challenge of inferring interpretable, high‑order feature interactions in large language models by exploiting hierarchical structure in the Fourier spectrum of the value function. It trains a gradient boosted tree proxy on masked inputs to efficiently capture interactions, then extracts a sparse, interpretable spectrum (≈$200$ coefficients) to approximate Shapley‑style attributions with far fewer model inferences than prior approaches like SPEX. Across four high‑dimensional datasets, ProxySPEX achieves substantially higher faithfulness than marginal methods and comparable or better fidelity with ~10× fewer inferences, delivering practical speedups (3–5×) for sentiment and image‑caption tasks. The work demonstrates data attribution and mechanistic interpretability capabilities, uncovering synergistic versus redundant data interactions and intra‑/inter‑head dynamics in large models, while acknowledging limitations for non‑hierarchical interactions and calling for future extensions in proxy models and spectral priors.
Abstract
Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n \approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using $10\times$ fewer inferences than SPEX. By accounting for interactions, ProxySPEX efficiently identifies the most influential features, providing a scalable approximation of their Shapley values. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task.
