Table of Contents
Fetching ...

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

Shichang Zhang, Tessa Han, Usha Bhalla, Himabindu Lakkaraju

TL;DR

The paper tackles the fragmentation of attribution methods by proposing a unified framework that encompasses feature, data, and component attribution. It demonstrates that FA, DA, and CA share core techniques—perturbations, gradients, and linear approximations—applied to $f(\mathbf{x})$, $\mathcal{D}_{\text{train}}$, and $c_k$, and shows how this common ground enables cross-attribution innovation and practical AI applications such as model editing and steering. Through a formal taxonomy and a synthesis of representative methods (e.g., Shapley-based, path-tracking, and causal mediation analyses), the work clarifies connections, evaluation criteria, and shared challenges like efficiency and stability. The proposed unified view aims to advance interpretability research and broader AI by providing a coherent lens to integrate feature-, data-, and component-level insights for more robust, controllable AI systems.

Abstract

The increasing complexity of AI systems has made understanding their behavior critical. Numerous interpretability methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components, which emerged from explainable AI, data-centric AI, and mechanistic interpretability, respectively. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of methods and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and a unified view of them benefits both interpretability and broader AI research. To this end, we first analyze popular methods for these three types of attributions and present a unified view demonstrating that these seemingly distinct methods employ similar techniques (such as perturbations, gradients, and linear approximations) over different aspects and thus differ primarily in their perspectives rather than techniques. Then, we demonstrate how this unified view enhances understanding of existing attribution methods, highlights shared concepts and evaluation criteria among these methods, and leads to new research directions both in interpretability research, by addressing common challenges and facilitating cross-attribution innovation, and in AI more broadly, with applications in model editing, steering, and regulation.

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

TL;DR

The paper tackles the fragmentation of attribution methods by proposing a unified framework that encompasses feature, data, and component attribution. It demonstrates that FA, DA, and CA share core techniques—perturbations, gradients, and linear approximations—applied to , , and , and shows how this common ground enables cross-attribution innovation and practical AI applications such as model editing and steering. Through a formal taxonomy and a synthesis of representative methods (e.g., Shapley-based, path-tracking, and causal mediation analyses), the work clarifies connections, evaluation criteria, and shared challenges like efficiency and stability. The proposed unified view aims to advance interpretability research and broader AI by providing a coherent lens to integrate feature-, data-, and component-level insights for more robust, controllable AI systems.

Abstract

The increasing complexity of AI systems has made understanding their behavior critical. Numerous interpretability methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components, which emerged from explainable AI, data-centric AI, and mechanistic interpretability, respectively. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of methods and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and a unified view of them benefits both interpretability and broader AI research. To this end, we first analyze popular methods for these three types of attributions and present a unified view demonstrating that these seemingly distinct methods employ similar techniques (such as perturbations, gradients, and linear approximations) over different aspects and thus differ primarily in their perspectives rather than techniques. Then, we demonstrate how this unified view enhances understanding of existing attribution methods, highlights shared concepts and evaluation criteria among these methods, and leads to new research directions both in interpretability research, by addressing common challenges and facilitating cross-attribution innovation, and in AI more broadly, with applications in model editing, steering, and regulation.

Paper Structure

This paper contains 79 sections, 5 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The three types of attribution: FA, DA, and CA. While each type seeks to attribute a model's output to a different aspect (input features, training data, and model components) and provides complementary insight into model behavior, they all use shared techniques (perturbations, gradients, and linear approximations).