How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking
Zhengyu Hu, Jieyu Zhang, Yue Yu, Yuchen Zhuang, Hui Xiong
TL;DR
LEMR addresses the challenge of selecting among many NLP models under restricted annotation budgets. It introduces a four-step approach that generates pseudo-labels from a model committee, selectively acquires ground-truth labels via uncertainty-aware sampling, updates the committee with a $Z$-score or All-model rule, and finally ranks models using refined labels, $r_p = r(L_p,L_g,\mathcal{M})$. MoraBench provides a diverse, task-spanning benchmark to evaluate label-efficient ranking across semi-supervised, weak supervision, and prompt-selection scenarios. Across 23 tasks, LEMR achieves substantial reductions in labeling costs while maintaining ranking accuracy, with uncertainty sampling and high-quality committees driving robust performance. Together, the framework and benchmark offer a practical pathway for resource-constrained model selection in NLP.
Abstract
This paper presents LEMR (Label-Efficient Model Ranking) and introduces the MoraBench Benchmark. LEMR is a novel framework that minimizes the need for costly annotations in model selection by strategically annotating instances from an unlabeled validation set. To evaluate LEMR, we leverage the MoraBench Benchmark, a comprehensive collection of model outputs across diverse scenarios. Our extensive evaluation across 23 different NLP tasks in semi-supervised learning, weak supervision, and prompt selection tasks demonstrates LEMR's effectiveness in significantly reducing labeling costs. Key findings highlight the impact of suitable ensemble methods, uncertainty sampling strategies, and model committee selection in enhancing model ranking accuracy. LEMR, supported by the insights from MoraBench, provides a cost-effective and accurate solution for model selection, especially valuable in resource-constrained environments.
