Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah; Andrew Ilyas; Aleksander Madry

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah, Andrew Ilyas, Aleksander Madry

TL;DR

This work proposes component modeling to decompose predictions into contributions from internal model components and introduces component attribution as a linear, interpretable surrogate. The COAR algorithm estimates per-example attributions by regressing observed counterfactuals from random component ablations, enabling scalable analysis across large vision and language models. Empirically, COAR-attributions accurately predict how outputs change under component ablations and generalize across datasets, architectures, and modalities. Moreover, attribution-driven editing (COAR-Edit) can target specific predictions, classes, subpopulations, backdoors, and typographic attacks with minimal impact on overall performance, highlighting practical utility for robust and safe model deployment.

Abstract

How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents .

Decomposing and Editing Predictions by Modeling Model Computation

TL;DR

Abstract

Paper Structure (88 sections, 7 equations, 27 figures, 1 algorithm)

This paper contains 88 sections, 7 equations, 27 figures, 1 algorithm.

Introduction
Roadmap & contributions.
Setup and Problem Statement
Component modeling.
Component attribution.
Component attribution with Coar
Approach.
Instantiating Coar for classifiers.
Does Coar learn accurate component attributions?
Datasets, models, and components.
Applying Coar.
Evaluation metric.
Baselines.
Results
Example-level analysis.
...and 73 more sections

Figures (27)

Figure 1: A summary of the component modeling framework.
Figure 2: Evaluating Coar attributions. We evaluate whether component attributions computed using our procedure Coar accurately predict component counterfactuals \ref{['eq:ablation']}. We compare Coar to four baselines (described in \ref{['sec:eval']}) on three image classification setups (one per row). The subfigures on the left each focus on a single example $z$ (visualized in the bottom-right corner of each plot), and show that for each setup, the ground-truth component counterfactuals $f_M(z, \cdot)$ ($x$-axis) and attribution-based estimates $g^{(z)}(\cdot)$ ($y$-axis) exhibit high correlation $\rho(z)$. On the right, we observe that Coar attributions exhibit high average correlation $\mathbb{E}_z[\rho(z)]$ over test examples, outperforming all baselines in each task and for all ablation fractions $\alpha_{\text{test}}$. The asterisk ($\hbox{*}$) in each legend denotes $\alpha_{\text{train}}$, the ablation fraction used to fit the component attributions.
Figure 3: Editing individual model predictions with Coar-Edit. We edit a ResNet50 model to correct a misclassified ImageNet example $z$ shown on the left. Specifically, ablating a few components via Coar-Edit (see \ref{['eq:c-edit']}) increases the correct-class margin \ref{['eq:margin']} on example $z$ (red) without changing the average margin on the train set (light blue) or validation set (dark blue). In the center panel, we observe that the examples on which model outputs change the least (top row) due to the edit are visually dissimilar to example $z$ as well as examples on which model outputs change most positively (middle row) and negatively (bottom row). On the right, we find that individually performing model edits to correct every misclassified example in the validation set incurs a median accuracy drop of at most $0.2\%$ on the train set (top row) and validation set (bottom row).
Figure 4: "Forgetting" a class with Coar-Edit. We edit an ImageNet-trained ResNet-50 (Setup B from \ref{['sec:eval']}) to selectively degrade performance on the "chain-link fence" class. On the left, we observe that increasing the number of components $k$ ablated via Coar-Edit decreases model accuracy on the "chain-link fence" class (red) while preserving overall accuracy on the train and validation set. In the center panel, we compare class-wise accuracies before and after performing the model edit and observe a significant accuracy drop on the "chain-link fence" class but not on other classes. On the right, we find that the edit transfers to distribution-shifted versions of ImageNet (ImageNet-Sketch wang2019learning and ImageNet$\star$vendrow2023dataset) as targeted, i.e., degrading performance on the "chain-link fence" class without changing average performance.
Figure 5: Improving subpopulation robustness with Coar-Edit. We edit pre-trained ResNet-50 models to improve their worst-subpopulation accuracy on two benchmark datasets: Waterbirds sagawa2020distributionally and CelebA liu2015deep. Before applying Coar-Edit, models fine-tuned on Waterbirds and CelebA attain $87\%$ and $96\%$ test accuracy but only $64\%$ and $47\%$ accuracy on their worst-performing subpopulations, respectively. On the left, applying Coar-Edit by ablating $210$ of $22,720$ components in the Waterbirds model increases worst-subpopulation accuracy from $64\%$ to $83\%$ without degrading its accuracy averaged over examples (light blue) and subpopulations (dark blue). Similarly, on the right, editing the CelebA model by ablating a targeted subset of $26$ components improves worst-subpopulation accuracy from $47\%$ to $85\%$.
...and 22 more figures

Theorems & Definitions (7)

Definition 1: Component modeling
Definition 2: Component attribution
Remark 1: Linearity and misspecification
Remark 2: Ablation is not removal
Remark 3: Relation between baselines and patching
Definition 3: Editing models by ablating components
Remark 4: Ablation-based edits

Decomposing and Editing Predictions by Modeling Model Computation

TL;DR

Abstract

Decomposing and Editing Predictions by Modeling Model Computation

Authors

TL;DR

Abstract

Table of Contents

Figures (27)

Theorems & Definitions (7)