Table of Contents
Fetching ...

Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models

Mohammadreza Tayaranian, Seyyed Hasan Mozafari, Brett H. Meyer, James J. Clark, Warren J. Gross

TL;DR

This work addresses the computational burden of fine-tuning large transformer-based language models by automatically pruning the downstream training data. It introduces the $ \mathcal{H}$-score, which quantifies per-example learning difficulty across multiple fine-tuning runs and epochs, enabling adaptive subset creation that includes the winning ticket subset. Empirical results across five NLP tasks and two models show that pruning to roughly one-third of the data often maintains or even improves evaluation accuracy, with notable gains on SST-2; the method outperforms forgetting-based and ambiguous baselines and supports faster neural architecture search. The proposed approach offers a principled, model-task-specific mechanism to reduce fine-tuning cost while preserving performance, with implications for efficiency in NAS and large-scale experimentation.

Abstract

Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and evaluation accuracy. Our largest subset, which we also refer to as the winning ticket subset, is on average $3 \times$ smaller than the original training set of the fine-tuning task. Our experiments on 5 downstream tasks and 2 language models show that, on average, fine-tuning on the winning ticket subsets results in a $0.1 \%$ increase in the evaluation performance of the model.

Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models

TL;DR

This work addresses the computational burden of fine-tuning large transformer-based language models by automatically pruning the downstream training data. It introduces the -score, which quantifies per-example learning difficulty across multiple fine-tuning runs and epochs, enabling adaptive subset creation that includes the winning ticket subset. Empirical results across five NLP tasks and two models show that pruning to roughly one-third of the data often maintains or even improves evaluation accuracy, with notable gains on SST-2; the method outperforms forgetting-based and ambiguous baselines and supports faster neural architecture search. The proposed approach offers a principled, model-task-specific mechanism to reduce fine-tuning cost while preserving performance, with implications for efficiency in NAS and large-scale experimentation.

Abstract

Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and evaluation accuracy. Our largest subset, which we also refer to as the winning ticket subset, is on average smaller than the original training set of the fine-tuning task. Our experiments on 5 downstream tasks and 2 language models show that, on average, fine-tuning on the winning ticket subsets results in a increase in the evaluation performance of the model.
Paper Structure (21 sections, 2 equations, 17 figures, 10 tables)

This paper contains 21 sections, 2 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Results of fine-tuning two transformer-based language models, OPT350Mzhang2022opt and RoBERTaLARGEliu2019roberta, on the winning ticket subset of various downstream tasks. For each task, the size of the winning ticket subset as a percentage of the full dataset is shown as gray bars. $\Delta$ Metric Performance, shown using red dots, is the change in the evaluation performance of the model which is fine-tuned on the full dataset, compared to the winning ticket subset. A positive $\Delta$ indicates that the winning ticket subset improved the evaluation performance of the model.
  • Figure 2: Distribution of the $\mathcal{H}$-score for the training set of the MNLI and SQuAD v2 tasks, calculated using two transformer-based models RoBERTaLARGE and OPT350M.
  • Figure 3: Subset size and evaluation accuracy of different subsets of the MNLI training set based on OPT350M. Each dot represents a subset. Subsets are created using Equation \ref{['eq:subset']} with all possible values of $M$. Our proposed subsets are noted with vertical dotted lines.
  • Figure 4: Evaluation accuracy of fine-tuning RoBERTaLARGE on different subsets of the training dataset of multiple downstream tasks. All the plots share the same provided legend. In \ref{['fig:result_race_roberta']} the ambiguous setup falls outside the range of the vertical axis. Detailed accuracies and subset sizes are provided in Appendix \ref{['sec:app:results']}.
  • Figure 5: Evaluation accuracy of fine-tuning OPT350M on different subsets of the training dataset of multiple downstream tasks. All the plots share the same provided legend. Detailed accuracies and subset sizes are provided in Appendix \ref{['sec:app:results']}.
  • ...and 12 more figures