LAUD: Integrating Large Language Models with Active Learning for Unlabeled Data
Tzu-Hsuan Chou, Chun-Nan Chou
TL;DR
LAUD addresses the challenge of labeled data scarcity for adapting LLMs to task-specific needs by integrating LLMs with an active-learning loop that seeds with zero-shot predictions and iteratively acquires informative annotations to produce task-specific LLMs (TLLMs) from unlabeled data. The framework combines initialization, an AL loop with fine-tuning or in-context learning, evaluation, and oracle roles to optimize annotation efficiency. Empirical results on commodity name classification show TLLMs consistently outperform zero-shot baselines and that active learning yields meaningful precision gains, while LLMs can serve as cost-effective oracles comparable to humans. In a real-world ad-targeting system, TLLMs derived via LAUD deliver substantial CTR improvements, highlighting practical impact and scalability.
Abstract
Large language models (LLMs) have shown a remarkable ability to generalize beyond their pre-training data, and fine-tuning LLMs can elevate performance to human-level and beyond. However, in real-world scenarios, lacking labeled data often prevents practitioners from obtaining well-performing models, thereby forcing practitioners to highly rely on prompt-based approaches that are often tedious, inefficient, and driven by trial and error. To alleviate this issue of lacking labeled data, we present a learning framework integrating LLMs with active learning for unlabeled dataset (LAUD). LAUD mitigates the cold-start problem by constructing an initial label set with zero-shot learning. Experimental results show that LLMs derived from LAUD outperform LLMs with zero-shot or few-shot learning on commodity name classification tasks, demonstrating the effectiveness of LAUD.
