Table of Contents
Fetching ...

Training Data Influence Analysis and Estimation: A Survey

Zayd Hammoudeh, Daniel Lowd

TL;DR

Training data quality critically shapes model predictions, yet the causal relation between data points and outcomes remains opaque in deep models. This survey consolidates seven influential pointwise training-data-influence estimators, split into retraining-based and gradient-based families, and analyzes their definitions, assumptions, and computational trade-offs. It also surveys extensions, applications, and future directions, highlighting challenges in scalability, robustness, and group-level influence. The work provides a resource hub and a roadmap for developing more reliable, empirically grounded data-valuations.

Abstract

Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.

Training Data Influence Analysis and Estimation: A Survey

TL;DR

Training data quality critically shapes model predictions, yet the causal relation between data points and outcomes remains opaque in deep models. This survey consolidates seven influential pointwise training-data-influence estimators, split into retraining-based and gradient-based families, and analyzes their definitions, assumptions, and computational trade-offs. It also surveys extensions, applications, and future directions, highlighting challenges in scalability, robustness, and group-level influence. The work provides a resource hub and a roadmap for developing more reliable, empirically grounded data-valuations.

Abstract

Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.
Paper Structure (49 sections, 66 equations, 5 figures, 4 tables, 4 algorithms)

This paper contains 49 sections, 66 equations, 5 figures, 4 tables, 4 algorithms.

Figures (5)

  • Figure 1: Outlier Pointwise Influence on Least-Squares Regression: Influence of a single outlier (\ref{['leg:EstimatorOverview:PointwiseInfluence:EarlyMethods:Outlier']}) on a least-squares model where in-distribution data (\ref{['leg:EstimatorOverview:PointwiseInfluence:EarlyMethods:Clean']}) are generated from linear distribution ${y = 2 x}$. The single outlier sample (${x = 5}$ & ${y = 1.2}$) influences the inlier-only least-squares linear model (\ref{['leg:EstimatorOverview:PointwiseInfluence:EarlyMethods:LSq:InDist']}) substantially such that a least-squares model trained on all instances (\ref{['leg:EstimatorOverview:PointwiseInfluence:EarlyMethods:LSq:All']}) predicts all training $y$ values poorly. Adapted from Rousseeuw:1997:RobustRegressionOutlierDetection.
  • Figure 2: Influence Analysis Taxonomy: Categorization of the seven primary pointwise influence analysis methods. Section \ref{['sec:Estimators:RetrainBased']} details the three primary retraining-based influence methods, leave-one-out (Sec. \ref{['sec:Estimators:RetrainBased:LeaveOneOut']}), Downsampling (Sec. \ref{['sec:Estimators:RetrainBased:Feldman']}), and Shapley value (Sec. \ref{['sec:Estimators:RetrainBased:Shapley']}). Section \ref{['sec:Estimators:GradientBased']} details gradient-based static estimators influence functions (Sec. \ref{['sec:Estimators:GradientBased:Static:IF']}) and representer point (Sec. \ref{['sec:Estimators:GradientBased:Static:RepresenterPoint']}) as well as dynamic estimators TracIn (Sec. \ref{['sec:Estimators:GradientBased:Dynamic:TracIn']}) and HyDRA (Sec. \ref{['sec:Estimators:GradientBased:Dynamic:HyDRA']}). Closely-related and derivative estimators are shown as a list below their parent method. See supplemental Table \ref{['tab:App:Nomenclature:InfDef']} for the formal mathematical definition of all influence methods and estimators. Due to space, each method's citation is in supplemental Table \ref{['tab:App:Nomenclature:Methods']}.
  • Figure : Dynamic influence estimation's training phase
  • Figure : TracIn influence estimation
  • Figure : Fast HyDRA influence estimation for gradient descent without momentum

Theorems & Definitions (9)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • Remark 8
  • Remark 9