Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah, Andrew Ilyas, Aleksander Madry
TL;DR
This work proposes component modeling to decompose predictions into contributions from internal model components and introduces component attribution as a linear, interpretable surrogate. The COAR algorithm estimates per-example attributions by regressing observed counterfactuals from random component ablations, enabling scalable analysis across large vision and language models. Empirically, COAR-attributions accurately predict how outputs change under component ablations and generalize across datasets, architectures, and modalities. Moreover, attribution-driven editing (COAR-Edit) can target specific predictions, classes, subpopulations, backdoors, and typographic attacks with minimal impact on overall performance, highlighting practical utility for robust and safe model deployment.
Abstract
How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents .
