Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
Shichang Zhang, Tessa Han, Usha Bhalla, Himabindu Lakkaraju
TL;DR
The paper tackles the fragmentation of attribution methods by proposing a unified framework that encompasses feature, data, and component attribution. It demonstrates that FA, DA, and CA share core techniques—perturbations, gradients, and linear approximations—applied to $f(\mathbf{x})$, $\mathcal{D}_{\text{train}}$, and $c_k$, and shows how this common ground enables cross-attribution innovation and practical AI applications such as model editing and steering. Through a formal taxonomy and a synthesis of representative methods (e.g., Shapley-based, path-tracking, and causal mediation analyses), the work clarifies connections, evaluation criteria, and shared challenges like efficiency and stability. The proposed unified view aims to advance interpretability research and broader AI by providing a coherent lens to integrate feature-, data-, and component-level insights for more robust, controllable AI systems.
Abstract
The increasing complexity of AI systems has made understanding their behavior critical. Numerous interpretability methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components, which emerged from explainable AI, data-centric AI, and mechanistic interpretability, respectively. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of methods and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and a unified view of them benefits both interpretability and broader AI research. To this end, we first analyze popular methods for these three types of attributions and present a unified view demonstrating that these seemingly distinct methods employ similar techniques (such as perturbations, gradients, and linear approximations) over different aspects and thus differ primarily in their perspectives rather than techniques. Then, we demonstrate how this unified view enhances understanding of existing attribution methods, highlights shared concepts and evaluation criteria among these methods, and leads to new research directions both in interpretability research, by addressing common challenges and facilitating cross-attribution innovation, and in AI more broadly, with applications in model editing, steering, and regulation.
