Table of Contents
Fetching ...

Accumulative SGD Influence Estimation for Data Attribution

Yunxiao Shi, Shuo Yang, Yixin Su, Rui Zhang, Min Xu

TL;DR

ACC-SGD-IE addresses the cross-epoch bias of SGD-IE in data-influence estimation by propagating leave-one-out perturbations with per-occurrence Hessian corrections across training. It derives a recursive, curvature-aware formulation and unrolls to a closed-form accumulation that prevents drift over long multi-epoch runs. The method yields geometric error contraction in smooth strongly convex settings and tighter non-convex bounds, with empirical gains in estimation fidelity and downstream data cleansing across diverse datasets and noise conditions. While incurring higher time and memory costs, ACC-SGD-IE establishes a new, transferable paradigm for accurate influence estimation in data-centric AI and points to scalable extensions and domain-specific optimizations.

Abstract

Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.

Accumulative SGD Influence Estimation for Data Attribution

TL;DR

ACC-SGD-IE addresses the cross-epoch bias of SGD-IE in data-influence estimation by propagating leave-one-out perturbations with per-occurrence Hessian corrections across training. It derives a recursive, curvature-aware formulation and unrolls to a closed-form accumulation that prevents drift over long multi-epoch runs. The method yields geometric error contraction in smooth strongly convex settings and tighter non-convex bounds, with empirical gains in estimation fidelity and downstream data cleansing across diverse datasets and noise conditions. While incurring higher time and memory costs, ACC-SGD-IE establishes a new, transferable paradigm for accurate influence estimation in data-centric AI and points to scalable extensions and domain-specific optimizations.

Abstract

Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.

Paper Structure

This paper contains 50 sections, 10 theorems, 50 equations, 7 figures, 5 tables.

Key Result

Theorem 4.1

Under asm:assumption_convex, for every training example $z_{k}$ and all $N\!>\!\pi_{1}(k)$

Figures (7)

  • Figure 1: Illustration of estimation bias across epochs in classical SGD-Influence Estimator.
  • Figure 2: Under non-convex regime: loss change estimation results clean, feature-noisy, and label-noisy data. Points closer to the black diagonal represent more accurate estimates of the real loss change. The Parts 1–3 correspond respectively to the entries in \ref{['tab:all_influence']}.
  • Figure 3: Under convex regime: loss change estimation results clean, feature-noisy, and label-noisy data.
  • Figure 4: Cross epoch estimation performance. SGD-IE VS. ACC-SGD-IE.
  • Figure 5: Data Cleansing result evaluated by average misclassification rate on test dataset.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Definition 1: Counterfactual SGD
  • Definition 2: SGD-Influence
  • Theorem 4.1
  • Theorem 4.2
  • Proposition B.1: Estimation Error of SGD-IE
  • Proposition B.2: Estimation Error of ACC-SGD-IE
  • Lemma 1: Linear contraction of $U_i$
  • Theorem C.1: SGD-IE: Polynomial Decay
  • Theorem C.2: ACC–SGD–IE: Geometric Decay
  • Theorem D.1: SGD-IE: Error Bound for Non-Convex
  • ...and 2 more