Language Model-Driven Data Pruning Enables Efficient Active Learning

Abdul Hameed Azeemi; Ihsan Ayyub Qazi; Agha Ali Raza

Language Model-Driven Data Pruning Enables Efficient Active Learning

Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza

TL;DR

ActivePrune addresses the high computational cost of acquisition-driven active learning on large unlabeled pools by a two-stage pruning approach: first a fast perplexity-based filter using KenLM, then a high-quality scoring step with a 4-bit quantized LLM (Gemma-2B). A perplexity reweighting mechanism further promotes diversity across AL iterations by favoring underrepresented samples. Across translation, sentiment analysis, topic classification, and summarization tasks, ActivePrune achieves superior selection quality while reducing end-to-end AL time by up to 74%, outperforming baseline pruning methods. The method is architecture-agnostic and improves interactivity in labeling workflows, enabling more efficient active learning at scale, with limitations noted in privacy and fairness considerations.

Abstract

Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. A key component in this procedure is an acquisition function that guides the selection process and identifies the suitable instances for labeling from the unlabeled pool. However, these acquisition methods suffer from high computational costs with large unlabeled data pools, posing a roadblock to their applicability on large datasets. To address this challenge and bridge this gap, we introduce a novel plug-and-play unlabeled data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. Additionally, to enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection in subsequent labeling iterations. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on four diverse datasets and four active learning strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality $\leftrightarrow$ efficiency tradeoff of the data pruning methods and demonstrate that ActivePrune is computationally more efficient than other LLM score-based pruning methods, and provides up to 74% reduction in the end-to-end time required for active learning.

Language Model-Driven Data Pruning Enables Efficient Active Learning

TL;DR

Abstract

efficiency tradeoff of the data pruning methods and demonstrate that ActivePrune is computationally more efficient than other LLM score-based pruning methods, and provides up to 74% reduction in the end-to-end time required for active learning.

Paper Structure (29 sections, 1 theorem, 19 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 1 theorem, 19 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Active Learning
Data Pruning
LLMs for Data Filtering
Method
Preliminaries
Perplexity Calculation
LLM Data Quality Scores
Perplexity Reweighting
Experimental Setup
Datasets
Models
Active Learning Methods
Baselines
...and 14 more sections

Key Result

Proposition 1

Let $\mathcal{U} = \{ x_i \}_{i=1}^{N}$ be the unlabeled pool and $\mathcal{L}_t$ be the labeled set at iteration $t$. Let $S \subseteq \mathcal{U}$ represent an underrepresented subset, containing the instances in $S$ that have perplexity scores significantly different from those in $\mathcal{L}_t$

Figures (4)

Figure 1: Illustration of the proposed ActivePrune framework for data pruning in Active Learning. Perplexity scores are first computed for the entire unlabeled pool through the KenLM 5-gram model, followed by the computation of data quality scores on a subset of examples through a quantized LLM. Then, the data pruning strategy leverages both these scores to prune the unlabeled pool and send it as the input to the AL acquisition function. After each iteration, a reweighting algorithm adjusts the perplexity distribution based on the selected examples to enhance the diversity for the next iteration.
Figure 2: An example prompt for determining the data quality score for a training example on the translation task.
Figure 3: Trade-off between quality and efficiency across various data pruning strategies on the IT domain dataset with the NSP acquisition method. The perplexity method in the figure reflects the perplexity scores computed through the LLM (Gemma-2B) and not the perplexity scores from KenLM (which is a component of ActivePrune only).
Figure 4: Distribution of Perplexity vs LLM data quality scores across samples selected through different pruning methods for the IMDB dataset. Each subplot represents sentences selected through the Active Learning (AL) procedure across 10 iterations, with each iteration involving the selection 1% of the dataset. Examples with orange color indicate examples with a high LLM quality score or a high perplexity score. The color gradient indicates the sequence of iterations, where the darkest shade denotes the first iteration and progressively lighter shades denote subsequent iterations, up to the lightest shade for the tenth iteration. Subplots are organized by sampling strategy.

Theorems & Definitions (2)

Proposition 1
proof

Language Model-Driven Data Pruning Enables Efficient Active Learning

TL;DR

Abstract

Language Model-Driven Data Pruning Enables Efficient Active Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)