Table of Contents
Fetching ...

Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding

Binh-Nguyen Nguyen, Yang He

TL;DR

Swift Cross-Dataset Pruning (SCDP) addresses data-efficiency in cross-dataset NLP fine-tuning by introducing a rapid Frequency Distance ($\mathrm{FD}$) score derived from TF-IDF sample vectors relative to an $\epsilon$-approximation of the geometric median $\mathbf{g}_\epsilon$. The approach couples this scoring with dataset size-adaptive pruning to preserve diversity across small and large corpora, producing compact coresets $(1-r)|S|$ that retain predictive performance. Across six NLP tasks, SCDP matches or exceeds baselines at pruning rates up to 70% while substantially reducing computation time compared to methods that require full-dataset training or reference models. The results suggest a practical, task-agnostic path to data-efficient fine-tuning in diverse NLP settings.

Abstract

Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: https://github.com/he-y/NLP-Dataset-Pruning

Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding

TL;DR

Swift Cross-Dataset Pruning (SCDP) addresses data-efficiency in cross-dataset NLP fine-tuning by introducing a rapid Frequency Distance () score derived from TF-IDF sample vectors relative to an -approximation of the geometric median . The approach couples this scoring with dataset size-adaptive pruning to preserve diversity across small and large corpora, producing compact coresets that retain predictive performance. Across six NLP tasks, SCDP matches or exceeds baselines at pruning rates up to 70% while substantially reducing computation time compared to methods that require full-dataset training or reference models. The results suggest a practical, task-agnostic path to data-efficient fine-tuning in diverse NLP settings.

Abstract

Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: https://github.com/he-y/NLP-Dataset-Pruning
Paper Structure (21 sections, 7 equations, 5 figures, 14 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: Accuracy and time required for ranking samples for SWAG, QNLI datasets with our proposed method, EL2N, CCS, Forgetting at 50% pruning rate and 70% pruning rate. Our method is significantly more time-efficient and yields higher accuracy.
  • Figure 2: Overview of the proposed method. We introduce the Frequency Distance (FD) score, in which we leverage TF-IDF embeddings combined with geometric median calculations to swiftly assess sample importance. We propose dataset size-adaptive pruning to enhance adaptability in cross-dataset setting.
  • Figure 3: Results of pruning strategies from set of distance-based scores.
  • Figure 4: PCA Plot of selected data points with regard to full training set for SWAG dataset at 70% pruning rates for different pruning strategies.
  • Figure 5: PCA Plot of selected data points with regard to full training set for QNLI dataset at 70% pruning rates for different pruning strategies.