Table of Contents
Fetching ...

LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora

Viviana Luccioli, Rithika Iyengar, Ryan Panley, Flora Haberkorn, Xiaoyu Ge, Leland Crane, Nitish Sinha, Seung Jung Lee

TL;DR

The paper tackles the cost-barrier of deploying large language models for text classification by fusing knowledge distillation with active learning. It introduces M-RARU, a multi-class randomized accept/reject uncertainty sampling strategy, to select only the most informative unlabeled examples for LLM labeling, thereby training lightweight student models with far fewer API calls. Across two real-world datasets and five student architectures, M-RARU consistently outperforms random sampling and achieves substantial gains in labeling efficiency, with up to 80% fewer labeled samples needed and notable improvements in accuracy and balanced accuracy. The approach combines embedding-based representations, uncertainty-driven querying, and interpretable downstream models to enable fast, cost-effective deployment of LLM-informed classifiers in resource-constrained settings.

Abstract

Large Language Models (LLMs) are highly accurate in classification tasks, however, substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM "teacher" trains a smaller and more efficient "student" model, offers a promising solution to this problem. However, the distillation process itself often remains costly for large datasets, since it requires the teacher to label a vast number of samples while incurring significant token consumption. To alleviate this challenge, in this work we explore the active learning (AL) as a way to create efficient student models at a fraction of the cost while preserving the LLM's performance. In particular, we introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm that significantly reduces training costs. M-RARU employs an innovative strategy combining uncertainty with a randomized accept-reject mechanism to select only the most informative data points for the LLM teacher. This focused approach significantly minimizes required API calls and data processing time. We evaluate M-RARU against random sampling across five diverse student models (SVM, LDA, RF, GBDT, and DistilBERT) on multiple benchmark datasets. Experiments demonstrate that our proposed method achieves up to 80% reduction in sample requirements as compared to random sampling, substantially improving classification accuracy while reducing financial costs and overall training time.

LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora

TL;DR

The paper tackles the cost-barrier of deploying large language models for text classification by fusing knowledge distillation with active learning. It introduces M-RARU, a multi-class randomized accept/reject uncertainty sampling strategy, to select only the most informative unlabeled examples for LLM labeling, thereby training lightweight student models with far fewer API calls. Across two real-world datasets and five student architectures, M-RARU consistently outperforms random sampling and achieves substantial gains in labeling efficiency, with up to 80% fewer labeled samples needed and notable improvements in accuracy and balanced accuracy. The approach combines embedding-based representations, uncertainty-driven querying, and interpretable downstream models to enable fast, cost-effective deployment of LLM-informed classifiers in resource-constrained settings.

Abstract

Large Language Models (LLMs) are highly accurate in classification tasks, however, substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM "teacher" trains a smaller and more efficient "student" model, offers a promising solution to this problem. However, the distillation process itself often remains costly for large datasets, since it requires the teacher to label a vast number of samples while incurring significant token consumption. To alleviate this challenge, in this work we explore the active learning (AL) as a way to create efficient student models at a fraction of the cost while preserving the LLM's performance. In particular, we introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm that significantly reduces training costs. M-RARU employs an innovative strategy combining uncertainty with a randomized accept-reject mechanism to select only the most informative data points for the LLM teacher. This focused approach significantly minimizes required API calls and data processing time. We evaluate M-RARU against random sampling across five diverse student models (SVM, LDA, RF, GBDT, and DistilBERT) on multiple benchmark datasets. Experiments demonstrate that our proposed method achieves up to 80% reduction in sample requirements as compared to random sampling, substantially improving classification accuracy while reducing financial costs and overall training time.

Paper Structure

This paper contains 16 sections, 4 equations, 21 figures, 3 tables, 1 algorithm.

Figures (21)

  • Figure 1: Text classification for GDP trends.
  • Figure 2: SVM Public Comments Accuracy
  • Figure 3: SVM Public Comments Balanced Accuracy
  • Figure 4: LDA Public Comments Accuracy
  • Figure 5: LDA Public Comments Balanced Accuracy
  • ...and 16 more figures