Table of Contents
Fetching ...

What Is The Performance Ceiling of My Classifier? Utilizing Category-Wise Influence Functions for Pareto Frontier Analysis

Shahriar Kabir Nahin, Wenxiao Xiao, Joshua Liu, Anshuman Chhabra, Hongfu Liu

TL;DR

The paper tackles the problem of identifying a classifier's performance ceiling from a category-aware perspective, beyond overall accuracy. It introduces category-wise influence functions and an influence vector $P(z) \in \mathbb{R}^K$, enabling Pareto frontier analysis across $K$ classes. A Pareto-LP-GA framework then reweights training samples via a linear program guided by $P(z)$ to achieve Pareto improvements, with modes for Direct Improvement and Course Correction. The authors validate the approach on synthetic data and real benchmarks (CIFAR-10, STL-10, Emotion, AG_News), showing substantial per-class gains with limited degradation in other classes, thereby providing a practical data-centric tool for per-class optimization and fairer performance tradeoffs.

Abstract

Data-centric learning seeks to improve model performance from the perspective of data quality, and has been drawing increasing attention in the machine learning community. Among its key tools, influence functions provide a powerful framework to quantify the impact of individual training samples on model predictions, enabling practitioners to identify detrimental samples and retrain models on a cleaner dataset for improved performance. However, most existing work focuses on the question: "what data benefits the learning model?" In this paper, we take a step further and investigate a more fundamental question: "what is the performance ceiling of the learning model?" Unlike prior studies that primarily measure improvement through overall accuracy, we emphasize category-wise accuracy and aim for Pareto improvements, ensuring that every class benefits, rather than allowing tradeoffs where some classes improve at the expense of others. To address this challenge, we propose category-wise influence functions and introduce an influence vector that quantifies the impact of each training sample across all categories. Leveraging these influence vectors, we develop a principled criterion to determine whether a model can still be improved, and further design a linear programming-based sample reweighting framework to achieve Pareto performance improvements. Through extensive experiments on synthetic datasets, vision, and text benchmarks, we demonstrate the effectiveness of our approach in estimating and achieving a model's performance improvement across multiple categories of interest.

What Is The Performance Ceiling of My Classifier? Utilizing Category-Wise Influence Functions for Pareto Frontier Analysis

TL;DR

The paper tackles the problem of identifying a classifier's performance ceiling from a category-aware perspective, beyond overall accuracy. It introduces category-wise influence functions and an influence vector , enabling Pareto frontier analysis across classes. A Pareto-LP-GA framework then reweights training samples via a linear program guided by to achieve Pareto improvements, with modes for Direct Improvement and Course Correction. The authors validate the approach on synthetic data and real benchmarks (CIFAR-10, STL-10, Emotion, AG_News), showing substantial per-class gains with limited degradation in other classes, thereby providing a practical data-centric tool for per-class optimization and fairer performance tradeoffs.

Abstract

Data-centric learning seeks to improve model performance from the perspective of data quality, and has been drawing increasing attention in the machine learning community. Among its key tools, influence functions provide a powerful framework to quantify the impact of individual training samples on model predictions, enabling practitioners to identify detrimental samples and retrain models on a cleaner dataset for improved performance. However, most existing work focuses on the question: "what data benefits the learning model?" In this paper, we take a step further and investigate a more fundamental question: "what is the performance ceiling of the learning model?" Unlike prior studies that primarily measure improvement through overall accuracy, we emphasize category-wise accuracy and aim for Pareto improvements, ensuring that every class benefits, rather than allowing tradeoffs where some classes improve at the expense of others. To address this challenge, we propose category-wise influence functions and introduce an influence vector that quantifies the impact of each training sample across all categories. Leveraging these influence vectors, we develop a principled criterion to determine whether a model can still be improved, and further design a linear programming-based sample reweighting framework to achieve Pareto performance improvements. Through extensive experiments on synthetic datasets, vision, and text benchmarks, we demonstrate the effectiveness of our approach in estimating and achieving a model's performance improvement across multiple categories of interest.

Paper Structure

This paper contains 21 sections, 1 equation, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Influence space for 2 categories.
  • Figure 2: Validation of our category-wise influence function methods for analyzing the Pareto frontier on two synthetic binary classification datasets with logistic regression. Subfigures A-C showcase results on a synthetic dataset that is linearly separable and contains noisy detrimental training samples, where performance can improve by mislabeled sample removal. Subfigures D-F detail results for our method on a non-linearly separable dataset without any noisy samples, where performance improvements cannot be made for either class without sacrificing performance for the other. Subfigures A and D showcase the distribution of training samples for each of the two datasets with blue and orange denoting the ground-truth class labels. Subfigures B and E showcase the category-wise influence score distribution for both datasets. Further, subfigures C and F map the influence values to the training samples using color intensity in accordance with class colors to denote the influence magnitudes, where the original class color means positive and red color means negative.
  • Figure 3: Real-world data experiments on CIFAR10krizhevsky2009learning image dataset.
  • Figure 4: Real-world data experiments on Emotionsaravia2018carer text dataset.
  • Figure 5: Experiment demonstrating the use of the category-wise influence function for dataset augmentation. The dataset used in this demonstration is identical to that in the top row of Figure 1. Each row displays the state of the dataset, the changes that our method will make, and the result, of the dataset trimming procedure discussed in Appendix A. Subfigures A, D, and G display the state of the training dataset. The color of each point indicates its label. The linear decision boundary is drawn, and its accuracy across both classes is shown in the legend. Subfigures B, E, and H show the category-wise influence score of each training point. Training data points in green are indicated by the score to be detrimental to model performance on both classes. These points will be removed by the improvement procedure. Subfigures C, F, and H show the training dataset after removal. The resulting decision boundary and its accuracy is also indicated. Note how the noise furthest from the decision boundary is removed first, since these have the largest effect on the decision boundary. Additionally, through each iteration, the accuracy of the linear model is improved when using the trimmed dataset. After three iterations, all noise has been removed and the model is at the performance ceiling.
  • ...and 3 more figures