Most Influential Subset Selection: Challenges, Promises, and Beyond

Yuzheng Hu; Pingbang Hu; Han Zhao; Jiaqi W. Ma

Most Influential Subset Selection: Challenges, Promises, and Beyond

Yuzheng Hu, Pingbang Hu, Han Zhao, Jiaqi W. Ma

TL;DR

This paper analyzes the Most Influential Subset Selection (MISS) problem, showing that traditional influence-function–based greedy methods can fail even in linear models due to non-additivity and leverage effects. It introduces an adaptive greedy algorithm that iteratively refits and reevaluates sample influence to capture interactions among training samples, providing theoretical support and empirical evidence across regression, classification, and non-linear neural networks. Key phenomena uncovered include amplification and cancellation of group effects, as well as the non-submodularity of higher-order approximations, which complicates guarantees for MISS. The work also discusses how the target function choice critically shapes the observed influence and questions the use of additive metrics like Linear Datamodeling Score (LDS), highlighting the need for function-aware, data-contextual MISS methods with consideration of the trade-off between performance and efficiency.

Abstract

How can we attribute the behaviors of machine learning models to their training data? While the classic influence function sheds light on the impact of individual samples, it often fails to capture the more complex and pronounced collective influence of a set of samples. To tackle this challenge, we study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence. We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses. Our findings reveal that influence-based greedy heuristics, a dominant class of algorithms in MISS, can provably fail even in linear regression. We delineate the failure modes, including the errors of influence function and the non-additive structure of the collective influence. Conversely, we demonstrate that an adaptive version of these heuristics which applies them iteratively, can effectively capture the interactions among samples and thus partially address the issues. Experiments on real-world datasets corroborate these theoretical findings and further demonstrate that the merit of adaptivity can extend to more complex scenarios such as classification tasks and non-linear neural networks. We conclude our analysis by emphasizing the inherent trade-off between performance and computational efficiency, questioning the use of additive metrics such as the Linear Datamodeling Score, and offering a range of discussions.

Most Influential Subset Selection: Challenges, Promises, and Beyond

TL;DR

Abstract

Paper Structure (54 sections, 10 theorems, 94 equations, 6 figures, 1 table)

This paper contains 54 sections, 10 theorems, 94 equations, 6 figures, 1 table.

Introduction
Contributions.
Concurrent work.
Preliminaries
Problem statement
Influence-based greedy heuristics
Pitfalls of greedy heuristics in Most Influential Subset Selection
Setup and notation.
Influence function is not accurate (even) in linear models
Violation of the additivity assumption: amplification and cancellation
Amplification.
Cancellation.
Promises of the adaptive greedy algorithm
Experiments
Evaluation metrics.
...and 39 more sections

Key Result

Theorem 3.1

Assume $h_{11}>h_{nn}$. Under the label generation process described in eq:label_gen, there exists some $p$, such that ZAMinfluence fails to select the most influential sample.

Figures (6)

Figure 1: Influence estimates suffer from disparate levels of under-estimation, leading to the failure of $1$-MISS
Figure 2: LAGS fails in $2$-MISS due to amplification
Figure 3: LAGS fails in $2$-MISS due to cancellation
Figure 4: Adaptive Greedy v.s. Greedy Algorithm. Row 1: Averaged actual effect $\overline{A_{-S}}$ measures the averaged actual effect induced by the greedy and adaptive greedy algorithms. Row 2: Winning rate indicates the proportion of instances where one algorithm outperforms the other.
Figure 5: Adaptive Greedy v.s. Greedy Algorithm. Left: Averaged actual effect $\overline{A_{-S}}$ measures the averaged actual effect induced by the greedy and adaptive greedy algorithms. Right: Winning rate indicates the proportion of instances where one algorithm outperforms the other.
...and 1 more figures

Theorems & Definitions (27)

Definition 2.1: Most Influential Subset Selection (MISS)
Definition 2.2: Upweighted objective
Definition 2.3: Influence function of a set
Theorem 3.1
Proposition 3.2
Remark 3.3
Proposition 3.4
Theorem 3.5
Theorem 3.6
Proposition 4.1
...and 17 more

Most Influential Subset Selection: Challenges, Promises, and Beyond

TL;DR

Abstract

Most Influential Subset Selection: Challenges, Promises, and Beyond

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (27)