Table of Contents
Fetching ...

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

Shichang Zhang, Hongzhe Du, Jiaqi W. Ma, Himabindu Lakkaraju

TL;DR

This work defines accountability attribution for modern AI systems by treating training stages as causal interventions and applying a potential outcomes framework to pose counterfactual questions about stage effects on final model behavior. It introduces the AA-Score, a first-order, retraining-free estimator that accounts for optimization dynamics, learning rate schedules, momentum, and weight decay to attribute performance changes to specific stages. The approach yields stage embeddings that enable efficient, model-specific attribution across inputs and metrics, and demonstrates the ability to identify both beneficial and harmful stages, including the emergence of spurious correlations and the impact of data distribution shifts. Through experiments on MNIST, CelebA, CivilComments, and chest X-ray datasets, the paper shows high correlation with retraining-based counterfactuals and reveals practical implications for debugging, auditing, and responsible AI deployment.

Abstract

Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model's behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model's behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

TL;DR

This work defines accountability attribution for modern AI systems by treating training stages as causal interventions and applying a potential outcomes framework to pose counterfactual questions about stage effects on final model behavior. It introduces the AA-Score, a first-order, retraining-free estimator that accounts for optimization dynamics, learning rate schedules, momentum, and weight decay to attribute performance changes to specific stages. The approach yields stage embeddings that enable efficient, model-specific attribution across inputs and metrics, and demonstrates the ability to identify both beneficial and harmful stages, including the emergence of spurious correlations and the impact of data distribution shifts. Through experiments on MNIST, CelebA, CivilComments, and chest X-ray datasets, the paper shows high correlation with retraining-based counterfactuals and reveals practical implications for debugging, auditing, and responsible AI deployment.

Abstract

Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model's behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model's behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

Paper Structure

This paper contains 42 sections, 3 theorems, 42 equations, 5 figures, 3 tables.

Key Result

Theorem 4.1

Assume the loss function $\mathcal{L}$ is twice differentiable with an $L$-Lipschitz continuous Hessian, and the propagator satisfies the spectral bound $\| M_{\color{timecolor}k} \|_2 \le e^{\eta_{\color{timecolor}k} \Lambda}$ for all ${\color{timecolor}k}$. Let $D_{{\color{treatmentcolor}S}} := \s

Figures (5)

  • Figure 1: Illustration of the accountability attribution problem of a generative AI model developed in three stages (e.g., pretraining and two fine-tuning rounds).
  • Figure 2: Illustration of the accountability attribution problem for a single stage. A model is developed in $N$ stages, each comprising a sequence of model parameter updates. The goal is to estimate the causal effect of a stage, e.g. Stage 2 on the final model behavior. The diagram shows the actual model development process (top) and a counterfactual version (bottom) where Stage 2 had not occurred. The accountability of Stage 2 for predicting an input $x$ is quantified by the performance difference between the observed ($\theta_{{\color{timecolor}K}}$) and counterfactual ($\theta_{{\color{timecolor}K}}({\color{treatmentcolor}0}_{{\color{treatmentcolor}2}}$)) model.
  • Figure 3: Performance effect on MNIST. Each bar shows the AA-Score estimation ($\hat{\tau}_{{\color{treatmentcolor}t}, {\color{timecolor}K}}$) for an update step $t$, which can be aggregated to stage effects. A positive$\hat{\tau}_{{\color{treatmentcolor}t}, {\color{timecolor}K}}$ indicates that the stage leads to a higher log-likelihood, i.e., the stage is beneficial. (a) Accurately detect an influential stage of an inserted data point. (b) Capture a stage processing mislabeled data, demonstrating their negative effect on the test performance. (c-e) The stage with the highest effect on the test set is the in-distribution (ID) training stage. (c) original test set. (d) 45-degree rotated test set. (e) 90-degree rotated test set. (f) Baseline for optimization parameters. (g) Higher/lower lr leads to higher/lower performance effect. (h) Higher/lower mom leads to higher/lower performance effect. (i) Higher/lower wd leads to slightly lower/higher performance effect. (j) Higher/lower wd leads to slightly lower/higher performance effect.
  • Figure 4: Performance effect estimation (left) and retraining likelihood with confounding features as the label (right) on CivilComments. There are 8 positive and 2 negative stages. We show the likelihood decreases when skipping the top 3 positive stages and increases when skipping the 2 negative stages.
  • Figure 5: The effect of inserting a test digit '4' during training on the model's ability to classify four different digits. (a) is the same case as Fig 1 (a) for effect on the same digit '4' that is inserted. (b) is the effect on another digit '4' from the test set. (c) is the effect on digit '9', which is easily confusable as '4'. (d) is the effect on a neutral digit '2', which is visually distinct from '4'.

Theorems & Definitions (6)

  • Definition 3.1
  • Definition 3.2
  • Theorem 4.1: Error Bound for Stage Effect Estimation
  • Theorem B.1: Error Bound for Stage Effect Estimation
  • Theorem B.2: Error Bound for the Effect of a Single Step
  • proof