Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

Ian Covert; Chanwoo Kim; Su-In Lee; James Zou; Tatsunori Hashimoto

Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

Ian Covert, Chanwoo Kim, Su-In Lee, James Zou, Tatsunori Hashimoto

TL;DR

This work explores training amortized models with noisy labels and shows that this approach tolerates high noise levels and significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.

Abstract

Many tasks in explainable machine learning, such as data valuation and feature attribution, perform expensive computation for each data point and are intractable for large datasets. These methods require efficient approximations, and although amortizing the process by learning a network to directly predict the desired output is a promising solution, training such models with exact labels is often infeasible. We therefore explore training amortized models with noisy labels, and we find that this is inexpensive and surprisingly effective. Through theoretical analysis of the label noise and experiments with various models and datasets, we show that this approach tolerates high noise levels and significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.

Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

TL;DR

Abstract

Paper Structure (26 sections, 6 theorems, 72 equations, 22 figures, 1 table)

This paper contains 26 sections, 6 theorems, 72 equations, 22 figures, 1 table.

Introduction
Background
Related work
Stochastic Amortization
Applications to Explainable ML
Shapley value feature attribution
Alternative feature attributions
Data valuation
Experiments
Feature attribution
Data valuation
Distributional data valuation
Conclusion
Extended Related Work
Datamodels
...and 11 more sections

Key Result

Theorem 1

Consider a noisy oracle $\tilde{a}(b)$ that satisfies $\mathbb{E}[\tilde{a}(b) \mid b] = \tilde{\theta} b$ with parameters $\tilde{\theta} \in \mathbb{R}^{m \times d}$ such that $\lVert\tilde{\theta}\rVert_F \leq D$. Given a distribution $p(b)$, define the norm-weighted distribution $q(b) \propto p( where $\mathrm{N}_q(\tilde{a}) \equiv \mathbb{E}_q[\mathrm{N}(\tilde{a} \mid b)]$ is the noisy orac

Figures (22)

Figure 1: Diagram of stochastic amortization. Left: using a dataset with noisy labels $\tilde{a}(b)$ (e.g., images and data valuation estimates), we can train an amortized model that accurately estimates the true outputs $a(b)$ (e.g., data valuation scores). Right: the default approach of running an expensive approximation algorithm for each example (e.g., a Monte Carlo estimator with many samples ghorbani2019data).
Figure 2: Stochastic amortization for Shapley value feature attributions. We compare the predicted attributions to the noisy labels and ground truth, which are generated using KernelSHAP with $512$ and 1M samples, respectively.
Figure 3: Amortized Shapley value feature attributions using KernelSHAP as a noisy oracle. Left: squared error relative to the ground truth attributions when using noisy labels with different numbers of samples (different noise levels). Center: estimation error as a function of FLOPs, where KernelSHAP incurs FLOPs via classifier predictions used to estimate the attributions, and amortization incurs additional FLOPs from training (training appears as a vertical line because the FLOPs are relatively low, and endpoints represent results from the final epoch). Right: estimation error with different training dataset sizes given equivalent compute per data point (matched by using fewer KernelSHAP samples when generating noisy labels for amortization and allowing up to $50$ epochs of training).
Figure 4: Amortized data valuation accuracy for tabular datasets. Left: mean squared error relative to the ground truth for the MiniBooNE dataset, normalized so that the mean valuation score has error equal to $1$ (for 1K and 10K data points). The x-axis indicates how many Monte Carlo samples were used for each data point. Center: Pearson correlation with the ground truth for the MiniBooNE dataset (for 1K and 10K data points). Right: estimation accuracy for the MiniBooNE and adult census datasets as a function of dataset size (250 to 10K data points); we use $50$ Monte Carlo samples per data point for all results and show the Pearson correlation with the ground truth.
Figure 5: Distributional data valuation for CIFAR-10. Left: estimation error when using different numbers of samples for the noisy label estimates. Center: Pearson correlation with the ground truth for different numbers of noisy samples. Right: estimation error as a function of dataset size, where all results use $5$ Monte Carlo samples per data points; we compare the error for amortized estimates on internal (training) and external (unseen) data points, demonstrating strong generalization.
...and 17 more figures

Theorems & Definitions (12)

Theorem 1
Proposition 1
proof
Theorem 1
proof
Corollary 1
proof
proof
Lemma 1
proof
...and 2 more

Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

TL;DR

Abstract

Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (12)