Table of Contents
Fetching ...

Relabeling Minimal Training Subset to Flip a Prediction

Jinghan Yang, Linjie Xu, Lequan Yu

TL;DR

This work addresses the problem of identifying the smallest training-subset $\mathcal{S}_t$ whose relabeling would flip the prediction for a test point $x_t$, enabling contestability and debugging of predictions. It introduces IP-relabel, an approach based on an extended influence function for binary classification with convex loss, and presents a computationally efficient algorithm with complexity $O(p^3+Np^2)$. The key findings show that $| abla S_t|$ can be as small as $|\,\mathcal{S}_t|<0.02N$, with $|\mathcal{S}_t|$ correlating with training noise and providing information beyond predicted probabilities, including revealing group-attribution bias. The contributions offer a practical robustness metric, a data-centric tool for bias detection, and pathways for data cleaning and fairness improvements in real-world settings, with extensions open to more complex models like LLMs and multi-class tasks. $| S_t|$ and $x_t$ appear throughout, and the threshold $\tau$ governs flip decisions.

Abstract

When facing an unsatisfactory prediction from a machine learning model, users can be interested in investigating the underlying reasons and exploring the potential for reversing the outcome. We ask: To flip the prediction on a test point $x_t$, how to identify the smallest training subset $\mathcal{S}_t$ that we need to relabel? We propose an efficient algorithm to identify and relabel such a subset via an extended influence function for binary classification models with convex loss. We find that relabeling fewer than 2% of the training points can always flip a prediction. This mechanism can serve multiple purposes: (1) providing an approach to challenge a model prediction by altering training points; (2) evaluating model robustness with the cardinality of the subset (i.e., $|\mathcal{S}_t|$); we show that $|\mathcal{S}_t|$ is highly related to the noise ratio in the training set and $|\mathcal{S}_t|$ is correlated with but complementary to predicted probabilities; and (3) revealing training points lead to group attribution bias. To the best of our knowledge, we are the first to investigate identifying and relabeling the minimal training subset required to flip a given prediction.

Relabeling Minimal Training Subset to Flip a Prediction

TL;DR

This work addresses the problem of identifying the smallest training-subset whose relabeling would flip the prediction for a test point , enabling contestability and debugging of predictions. It introduces IP-relabel, an approach based on an extended influence function for binary classification with convex loss, and presents a computationally efficient algorithm with complexity . The key findings show that can be as small as , with correlating with training noise and providing information beyond predicted probabilities, including revealing group-attribution bias. The contributions offer a practical robustness metric, a data-centric tool for bias detection, and pathways for data cleaning and fairness improvements in real-world settings, with extensions open to more complex models like LLMs and multi-class tasks. and appear throughout, and the threshold governs flip decisions.

Abstract

When facing an unsatisfactory prediction from a machine learning model, users can be interested in investigating the underlying reasons and exploring the potential for reversing the outcome. We ask: To flip the prediction on a test point , how to identify the smallest training subset that we need to relabel? We propose an efficient algorithm to identify and relabel such a subset via an extended influence function for binary classification models with convex loss. We find that relabeling fewer than 2% of the training points can always flip a prediction. This mechanism can serve multiple purposes: (1) providing an approach to challenge a model prediction by altering training points; (2) evaluating model robustness with the cardinality of the subset (i.e., ); we show that is highly related to the noise ratio in the training set and is correlated with but complementary to predicted probabilities; and (3) revealing training points lead to group attribution bias. To the best of our knowledge, we are the first to investigate identifying and relabeling the minimal training subset required to flip a given prediction.
Paper Structure (19 sections, 4 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: The question we seek to answer is: which is the smallest subset of the training data that needs to be relabeled in order to flip a specific prediction from the model?
  • Figure 2: The relationship between the average of absolute difference on predicted probabilities for sampled test points results from relabeled $k=|\mathcal{S}_t|$ training points, using different methods on movie review dataset.
  • Figure 3: The histogram shows the distribution of $k=|\mathcal{S}_t|$ on the hate speech dataset, i.e. the minimal number of points that need to be relabeled from the training data to change the prediction $\hat{y}_t$ of a specific test example $x_t$.
  • Figure 4: Comparison of the average $k = |\mathcal{S}_t|$ values for shared test points under both BERT and LR models that were successfully flipped by our method.
  • Figure 5: The correlation between the predicted probabilities of certain test examples and $k=|\mathcal{S}_t|$ on the hate speech dataset. For test examples where the model is highly certain about its prediction, the prediction can be flipped by relabeling a small number of data points from the training set.
  • ...and 5 more figures