Table of Contents
Fetching ...

LAUD: Integrating Large Language Models with Active Learning for Unlabeled Data

Tzu-Hsuan Chou, Chun-Nan Chou

TL;DR

LAUD addresses the challenge of labeled data scarcity for adapting LLMs to task-specific needs by integrating LLMs with an active-learning loop that seeds with zero-shot predictions and iteratively acquires informative annotations to produce task-specific LLMs (TLLMs) from unlabeled data. The framework combines initialization, an AL loop with fine-tuning or in-context learning, evaluation, and oracle roles to optimize annotation efficiency. Empirical results on commodity name classification show TLLMs consistently outperform zero-shot baselines and that active learning yields meaningful precision gains, while LLMs can serve as cost-effective oracles comparable to humans. In a real-world ad-targeting system, TLLMs derived via LAUD deliver substantial CTR improvements, highlighting practical impact and scalability.

Abstract

Large language models (LLMs) have shown a remarkable ability to generalize beyond their pre-training data, and fine-tuning LLMs can elevate performance to human-level and beyond. However, in real-world scenarios, lacking labeled data often prevents practitioners from obtaining well-performing models, thereby forcing practitioners to highly rely on prompt-based approaches that are often tedious, inefficient, and driven by trial and error. To alleviate this issue of lacking labeled data, we present a learning framework integrating LLMs with active learning for unlabeled dataset (LAUD). LAUD mitigates the cold-start problem by constructing an initial label set with zero-shot learning. Experimental results show that LLMs derived from LAUD outperform LLMs with zero-shot or few-shot learning on commodity name classification tasks, demonstrating the effectiveness of LAUD.

LAUD: Integrating Large Language Models with Active Learning for Unlabeled Data

TL;DR

LAUD addresses the challenge of labeled data scarcity for adapting LLMs to task-specific needs by integrating LLMs with an active-learning loop that seeds with zero-shot predictions and iteratively acquires informative annotations to produce task-specific LLMs (TLLMs) from unlabeled data. The framework combines initialization, an AL loop with fine-tuning or in-context learning, evaluation, and oracle roles to optimize annotation efficiency. Empirical results on commodity name classification show TLLMs consistently outperform zero-shot baselines and that active learning yields meaningful precision gains, while LLMs can serve as cost-effective oracles comparable to humans. In a real-world ad-targeting system, TLLMs derived via LAUD deliver substantial CTR improvements, highlighting practical impact and scalability.

Abstract

Large language models (LLMs) have shown a remarkable ability to generalize beyond their pre-training data, and fine-tuning LLMs can elevate performance to human-level and beyond. However, in real-world scenarios, lacking labeled data often prevents practitioners from obtaining well-performing models, thereby forcing practitioners to highly rely on prompt-based approaches that are often tedious, inefficient, and driven by trial and error. To alleviate this issue of lacking labeled data, we present a learning framework integrating LLMs with active learning for unlabeled dataset (LAUD). LAUD mitigates the cold-start problem by constructing an initial label set with zero-shot learning. Experimental results show that LLMs derived from LAUD outperform LLMs with zero-shot or few-shot learning on commodity name classification tasks, demonstrating the effectiveness of LAUD.

Paper Structure

This paper contains 21 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Illustration of LAUD. LAUD integrates LLMs with active learning to derive TLLMs from unlabeled data. One or more oracles in LAUD are queried to provide annotations for training and evaluating TLLMs.