Relabeling Minimal Training Subset to Flip a Prediction

Jinghan Yang; Linjie Xu; Lequan Yu

Relabeling Minimal Training Subset to Flip a Prediction

Jinghan Yang, Linjie Xu, Lequan Yu

TL;DR

This work addresses the problem of identifying the smallest training-subset $\mathcal{S}_t$ whose relabeling would flip the prediction for a test point $x_t$, enabling contestability and debugging of predictions. It introduces IP-relabel, an approach based on an extended influence function for binary classification with convex loss, and presents a computationally efficient algorithm with complexity $O(p^3+Np^2)$. The key findings show that $| abla S_t|$ can be as small as $|\,\mathcal{S}_t|<0.02N$, with $|\mathcal{S}_t|$ correlating with training noise and providing information beyond predicted probabilities, including revealing group-attribution bias. The contributions offer a practical robustness metric, a data-centric tool for bias detection, and pathways for data cleaning and fairness improvements in real-world settings, with extensions open to more complex models like LLMs and multi-class tasks. $| S_t|$ and $x_t$ appear throughout, and the threshold $\tau$ governs flip decisions.

Abstract

When facing an unsatisfactory prediction from a machine learning model, users can be interested in investigating the underlying reasons and exploring the potential for reversing the outcome. We ask: To flip the prediction on a test point $x_t$, how to identify the smallest training subset $\mathcal{S}_t$ that we need to relabel? We propose an efficient algorithm to identify and relabel such a subset via an extended influence function for binary classification models with convex loss. We find that relabeling fewer than 2% of the training points can always flip a prediction. This mechanism can serve multiple purposes: (1) providing an approach to challenge a model prediction by altering training points; (2) evaluating model robustness with the cardinality of the subset (i.e., $|\mathcal{S}_t|$); we show that $|\mathcal{S}_t|$ is highly related to the noise ratio in the training set and $|\mathcal{S}_t|$ is correlated with but complementary to predicted probabilities; and (3) revealing training points lead to group attribution bias. To the best of our knowledge, we are the first to investigate identifying and relabeling the minimal training subset required to flip a given prediction.

Relabeling Minimal Training Subset to Flip a Prediction

TL;DR

This work addresses the problem of identifying the smallest training-subset

whose relabeling would flip the prediction for a test point

, enabling contestability and debugging of predictions. It introduces IP-relabel, an approach based on an extended influence function for binary classification with convex loss, and presents a computationally efficient algorithm with complexity

. The key findings show that

can be as small as

, with

correlating with training noise and providing information beyond predicted probabilities, including revealing group-attribution bias. The contributions offer a practical robustness metric, a data-centric tool for bias detection, and pathways for data cleaning and fairness improvements in real-world settings, with extensions open to more complex models like LLMs and multi-class tasks.

and

appear throughout, and the threshold

governs flip decisions.

Abstract

, how to identify the smallest training subset

that we need to relabel? We propose an efficient algorithm to identify and relabel such a subset via an extended influence function for binary classification models with convex loss. We find that relabeling fewer than 2% of the training points can always flip a prediction. This mechanism can serve multiple purposes: (1) providing an approach to challenge a model prediction by altering training points; (2) evaluating model robustness with the cardinality of the subset (i.e.,

); we show that

is highly related to the noise ratio in the training set and

is correlated with but complementary to predicted probabilities; and (3) revealing training points lead to group attribution bias. To the best of our knowledge, we are the first to investigate identifying and relabeling the minimal training subset required to flip a given prediction.

Paper Structure (19 sections, 4 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Methods
Algorithm
Case Study
Experiments
Experimental Setting
Algorithm Validation
$|\mathcal{S}_t|$ Quantifies Model Robustness
Composition of $\mathcal{S}_t$ Contributes Bias Explanation
Comparison between Removal and Relabeling
Related Work
Discussion and Future Work
Conclusions
Limitations and Risks
Appendix
...and 4 more sections

Figures (10)

Figure 1: The question we seek to answer is: which is the smallest subset of the training data that needs to be relabeled in order to flip a specific prediction from the model?
Figure 2: The relationship between the average of absolute difference on predicted probabilities for sampled test points results from relabeled $k=|\mathcal{S}_t|$ training points, using different methods on movie review dataset.
Figure 3: The histogram shows the distribution of $k=|\mathcal{S}_t|$ on the hate speech dataset, i.e. the minimal number of points that need to be relabeled from the training data to change the prediction $\hat{y}_t$ of a specific test example $x_t$.
Figure 4: Comparison of the average $k = |\mathcal{S}_t|$ values for shared test points under both BERT and LR models that were successfully flipped by our method.
Figure 5: The correlation between the predicted probabilities of certain test examples and $k=|\mathcal{S}_t|$ on the hate speech dataset. For test examples where the model is highly certain about its prediction, the prediction can be flipped by relabeling a small number of data points from the training set.
...and 5 more figures

Relabeling Minimal Training Subset to Flip a Prediction

TL;DR

Abstract

Relabeling Minimal Training Subset to Flip a Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (10)