Table of Contents
Fetching ...

Label-Efficient Model Selection for Text Generation

Shir Ashury-Tahan, Ariel Gera, Benjamin Sznajder, Leshem Choshen, Liat Ein-Dor, Eyal Shnarch

TL;DR

Evaluating text-generation models is expensive due to the need for reliable preference judgments. DiffUse provides a label-efficient framework that builds semantic difference embeddings from model outputs, clusters these differences, and selects a small, informative set of examples for oracle annotation. Across HELM benchmarks with six generation tasks and hundreds of model-pair evaluations, DiffUse reduces annotation needs by up to $75\%$ while preserving reliable model ranking, and extends to prompt and configuration selection via an iterative, risk-based stopping criterion. The approach is model-agnostic and emphasizes the structure of output differences, with analyses showing high-norm difference vectors drive informative selections and guiding future work in efficient model assessment.

Abstract

Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.

Label-Efficient Model Selection for Text Generation

TL;DR

Evaluating text-generation models is expensive due to the need for reliable preference judgments. DiffUse provides a label-efficient framework that builds semantic difference embeddings from model outputs, clusters these differences, and selects a small, informative set of examples for oracle annotation. Across HELM benchmarks with six generation tasks and hundreds of model-pair evaluations, DiffUse reduces annotation needs by up to while preserving reliable model ranking, and extends to prompt and configuration selection via an iterative, risk-based stopping criterion. The approach is model-agnostic and emphasizes the structure of output differences, with analyses showing high-norm difference vectors drive informative selections and guiding future work in efficient model assessment.

Abstract

Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.
Paper Structure (29 sections, 3 equations, 17 figures, 3 tables)

This paper contains 29 sections, 3 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Aggregated success rate over CNN/DailyMail, across all $666$ model pairs ($\times10$ repetitions for each pair). DiffUse demonstrates a clear advantage in correctly determining the stronger model, based on a small number of oracle-annotated examples.
  • Figure 2: DiffUse flow. Our method consists of $5$ steps: performing inference with the models on the test set, encoding the generated outputs, performing pairwise subtraction, clustering the resulting vectors, and selecting representatives for evaluation. A comprehensive description is provided in §\ref{['sec:method']}.
  • Figure 3: Distribution of test winning distances (§\ref{['sec:problem_formulation']}) in HELM between pairs of generative models.
  • Figure 4: Comparing example selection methods. Success rates (± standard error) in identifying the best of two competing generative models (listed in the plot title), in terms of their performance over CNN/DailyMail (using Rouge-2 as the oracle).
  • Figure 5: Difference between the estimated and test winning distance, aggregated across all model pairs over XSum. Shaded areas denote standard error (averaged across pairs). Clearly, DiffUse favors the test winning model, giving a biased estimate in its favor. The bias dissipates with additional annotations, converging to the true distance for the full set of examples.
  • ...and 12 more figures