Table of Contents
Fetching ...

In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Ayrton San Joaquin, Bin Wang, Zhengyuan Liu, Nicholas Asher, Brian Lim, Philippe Muller, Nancy F. Chen

TL;DR

The In2Core algorithm is proposed, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model, and an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy.

Abstract

Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set's coverage of those test points.

In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

TL;DR

The In2Core algorithm is proposed, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model, and an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy.

Abstract

Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set's coverage of those test points.
Paper Structure (19 sections, 3 equations, 7 figures, 1 table)

This paper contains 19 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Memory efficiency across various number of LoRA layers to consider when calculating influence values. The numbers above each bar correspond to the virtual CPU memory consumed (in GB). The memory efficiency rate is not linear, and varies across different models. One should sample a subset to calculate the optimal combination of CPU virtual memory and number of layers to use given their hardware constraints.
  • Figure 2: Overview of In2Core for coreset selection. From left to right, we first calculate the influence values of each training point using the validation dataset and a reference model fine-tuned on the full dataset with LoRA. Then, we rank the training points by influence values and select the $h$ highest-scoring proponents as the final training data, where $h$ is a hyperparameter. Finally, we train a base model on this final training data. Both the reference and base model may have distinct architectures from each other.
  • Figure 3: Mean Perplexity of models fine-tuned on different coreset selection strategies on the 250-point Ultrachat Evaluation Set. These strategies differ by selecting based on the influence values. 'Full' denotes a model trained on the full training data. Proponents, which is the default strategy in practice, outperform all groups except Full in all cases. Interestingly, some strategies result in a worse model than Random. In Section \ref{['disc_coreset']}, we argue that some points inherently degrade model training when included (e.g. Minimum and Opponents).
  • Figure 4: Histogram of influence values of the 50k training set using Gemma-2B as the reference model. The distribution is left-skewed, with the majority of influence values being negative. The Opponent and Minimum groups, (points with positive values and values smallest in influence value magnitude respectively) are located to the right side of the histogram and have high overlap (overlap coefficient = 0.59. Note that the numbers in blue denote the rounded value of the extremes of the distribution.
  • Figure 5: Relationship between Perplexity and Measures of Importance of the Training Set on the Test Set (N=250). Influence values provide a better signal (correlation coefficient = 0.56) to indicate how well a model generalizes to a particular test point compared to semantic similarity (correlation coefficient = -0.087).
  • ...and 2 more figures