How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking

Zhengyu Hu; Jieyu Zhang; Yue Yu; Yuchen Zhuang; Hui Xiong

How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking

Zhengyu Hu, Jieyu Zhang, Yue Yu, Yuchen Zhuang, Hui Xiong

TL;DR

LEMR addresses the challenge of selecting among many NLP models under restricted annotation budgets. It introduces a four-step approach that generates pseudo-labels from a model committee, selectively acquires ground-truth labels via uncertainty-aware sampling, updates the committee with a $Z$-score or All-model rule, and finally ranks models using refined labels, $r_p = r(L_p,L_g,\mathcal{M})$. MoraBench provides a diverse, task-spanning benchmark to evaluate label-efficient ranking across semi-supervised, weak supervision, and prompt-selection scenarios. Across 23 tasks, LEMR achieves substantial reductions in labeling costs while maintaining ranking accuracy, with uncertainty sampling and high-quality committees driving robust performance. Together, the framework and benchmark offer a practical pathway for resource-constrained model selection in NLP.

Abstract

This paper presents LEMR (Label-Efficient Model Ranking) and introduces the MoraBench Benchmark. LEMR is a novel framework that minimizes the need for costly annotations in model selection by strategically annotating instances from an unlabeled validation set. To evaluate LEMR, we leverage the MoraBench Benchmark, a comprehensive collection of model outputs across diverse scenarios. Our extensive evaluation across 23 different NLP tasks in semi-supervised learning, weak supervision, and prompt selection tasks demonstrates LEMR's effectiveness in significantly reducing labeling costs. Key findings highlight the impact of suitable ensemble methods, uncertainty sampling strategies, and model committee selection in enhancing model ranking accuracy. LEMR, supported by the insights from MoraBench, provides a cost-effective and accurate solution for model selection, especially valuable in resource-constrained environments.

How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking

TL;DR

-score or All-model rule, and finally ranks models using refined labels,

. MoraBench provides a diverse, task-spanning benchmark to evaluate label-efficient ranking across semi-supervised, weak supervision, and prompt-selection scenarios. Across 23 tasks, LEMR achieves substantial reductions in labeling costs while maintaining ranking accuracy, with uncertainty sampling and high-quality committees driving robust performance. Together, the framework and benchmark offer a practical pathway for resource-constrained model selection in NLP.

Abstract

Paper Structure (41 sections, 5 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 41 sections, 5 equations, 7 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Pseudo-labeling
Model Selection
Preliminaries
Methodology
Step-I: Pseudo-label Generation
Step-II: Active Label Acquisition
Step-III: Model Committee Selection
Step-IV: Model Ranking
The MoraBench Benchmark
Evaluation Metrics
Optimal Gap.
Ranking Correction.
Experiments
...and 26 more sections

Figures (7)

Figure 1: The illustration of the overall procedure of LEMR.
Figure 2: Semi-supervised learning setting: This figure illustrates the changes in ranking correction values within our design space. These changes are observed across budget ratios from 0 to 1. The number after the dataset indicates the number of labels under the model training stage.
Figure 3: Weak supervision setting: This figure illustrates the changes in ranking correction values within our design space. These changes are observed across budget ratios from 0 to 1.
Figure 4: Prompt selection setting: This figure illustrates the changes in ranking correction values within our design space. These changes are observed across budget ratios from 0 to 1. The number after the dataset indicates the number of labels under the semi-supervised learning setting.
Figure 5: Weak supervision setting: This figure illustrates the changes in optimal gap values within our design space. These changes are observed across budget ratios from 0 to 1.
...and 2 more figures

How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking

TL;DR

Abstract

How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking

Authors

TL;DR

Abstract

Table of Contents

Figures (7)