DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning
Jiaxin Guo, C. L. Philip Chen, Shuzhen Li, Tong Zhang
TL;DR
DEUCE tackles cold-start active learning by promoting dual-diversity and uncertainty awareness to acquire informative and balanced seed sets. It deploys a prompt-based embedding module to produce textual and class representations, then builds a Dual-Neighbor Graph (DNG) that fuses textual and label-space neighborhoods. Through density-based clustering (hdbscan*) and uncertainty propagation, followed by Farthest Point Sampling, DEUCE selects hard representative instances that improve class balance and textual coverage. Experiments across six NLP datasets show DEUCE consistently outperforms strong CSAL baselines and remains more efficient than LLM-based acquisition methods, highlighting its practical value for low-resource labeling scenarios. The work advances CSAL by integrating dual-diversity with principled uncertainty propagation in a graph-based acquisition pipeline, offering interpretable and scalable improvements for real-world text classification tasks.
Abstract
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
