Table of Contents
Fetching ...

How to Achieve Higher Accuracy with Less Training Points?

Jinghan Yang, Anupam Pani, Yunchao Zhang

TL;DR

This work tackles the data-efficiency problem in large-scale model training by leveraging influence functions to identify informative training points that can be added to a smaller subset without sacrificing predictive performance. Building on the IP-adding framework, the authors propose six training-point selection methods to optimize the impact of additional data on test predictions, and validate them using a logistic regression model for binary sentiment analysis with Bag-of-Words features. Key findings show that selective data inclusion can achieve comparable accuracy to full-data training with only about $10\%$ of the added data, and can even improve accuracy when using around $60\%$ of the data, with Method 1 and Method 3 often performing best. The approach offers a practical path to Green AI by reducing computational cost and data requirements, and points to further work in extending the method to neural networks, multi-class tasks, distribution shifts, and theoretical generalization analyses.

Abstract

In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.

How to Achieve Higher Accuracy with Less Training Points?

TL;DR

This work tackles the data-efficiency problem in large-scale model training by leveraging influence functions to identify informative training points that can be added to a smaller subset without sacrificing predictive performance. Building on the IP-adding framework, the authors propose six training-point selection methods to optimize the impact of additional data on test predictions, and validate them using a logistic regression model for binary sentiment analysis with Bag-of-Words features. Key findings show that selective data inclusion can achieve comparable accuracy to full-data training with only about of the added data, and can even improve accuracy when using around of the data, with Method 1 and Method 3 often performing best. The approach offers a practical path to Green AI by reducing computational cost and data requirements, and points to further work in extending the method to neural networks, multi-class tasks, distribution shifts, and theoretical generalization analyses.

Abstract

In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.

Paper Structure

This paper contains 15 sections, 6 equations, 1 table, 1 algorithm.