Natural Geometry of Robust Data Attribution: From Convex Models to Deep Networks
Shihao Li, Jiachen Li, Dongmei Chen
TL;DR
This work addresses the fragility of data attribution under distributional shifts by proposing certified robust attribution via Wasserstein DRO. It reveals a spectral amplification barrier in deep networks under Euclidean geometry and introduces the Natural Wasserstein metric, grounded in the model's feature covariance, to obtain non-vacuous certificates. The key theoretical contribution is the Self-Influence measure, shown to equal the attribution Lipschitz constant under the Natural metric, linking leverage-based anomaly detection to robustness. Empirically, Natural W-TRAK yields non-vacuous certificates (68.7% of ranking pairs on CIFAR-10 with ResNet-18) and enables effective identification of mislabeled data (0.970 AUROC for label-noise detection), providing both practical data valuation and a principled robustness guarantee for attribution in deep networks.
Abstract
Data attribution methods identify which training examples are responsible for a model's predictions, but their sensitivity to distributional perturbations undermines practical reliability. We present a unified framework for certified robust attribution that extends from convex models to deep networks. For convex settings, we derive Wasserstein-Robust Influence Functions (W-RIF) with provable coverage guarantees. For deep networks, we demonstrate that Euclidean certification is rendered vacuous by spectral amplification -- a mechanism where the inherent ill-conditioning of deep representations inflates Lipschitz bounds by over $10{,}000\times$. This explains why standard TRAK scores, while accurate point estimates, are geometrically fragile: naive Euclidean robustness analysis yields 0\% certification. Our key contribution is the Natural Wasserstein metric, which measures perturbations in the geometry induced by the model's own feature covariance. This eliminates spectral amplification, reducing worst-case sensitivity by $76\times$ and stabilizing attribution estimates. On CIFAR-10 with ResNet-18, Natural W-TRAK certifies 68.7\% of ranking pairs compared to 0\% for Euclidean baselines -- to our knowledge, the first non-vacuous certified bounds for neural network attribution. Furthermore, we prove that the Self-Influence term arising from our analysis equals the Lipschitz constant governing attribution stability, providing theoretical grounding for leverage-based anomaly detection. Empirically, Self-Influence achieves 0.970 AUROC for label noise detection, identifying 94.1\% of corrupted labels by examining just the top 20\% of training data.
