Table of Contents
Fetching ...

Active Model Selection for Large Language Models

Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel

TL;DR

The paper tackles the problem of selecting the best large language model (LLM) for a given task under scarce annotation resources. It introduces LLM Selector, a judge-based, active data-collection framework that incrementally annotates at most $b$ queries to maximize the mutual information about the best model in a pool of $m$ candidates, without requiring access to internal model parameters. The approach leverages a two-parameter noise model and an ensemble of weak $k$-gram judges to generate noisy annotations, updating a posterior over the best model and selecting queries sequentially to minimize conditional entropy. Empirically, LLM Selector achieves strong identification performance while dramatically reducing annotation costs (up to about $59\%$) across six benchmarks and 151 LLMs, demonstrating robustness and practical applicability in black-box, API-based deployment scenarios.

Abstract

We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.

Active Model Selection for Large Language Models

TL;DR

The paper tackles the problem of selecting the best large language model (LLM) for a given task under scarce annotation resources. It introduces LLM Selector, a judge-based, active data-collection framework that incrementally annotates at most queries to maximize the mutual information about the best model in a pool of candidates, without requiring access to internal model parameters. The approach leverages a two-parameter noise model and an ensemble of weak -gram judges to generate noisy annotations, updating a posterior over the best model and selecting queries sequentially to minimize conditional entropy. Empirically, LLM Selector achieves strong identification performance while dramatically reducing annotation costs (up to about ) across six benchmarks and 151 LLMs, demonstrating robustness and practical applicability in black-box, API-based deployment scenarios.

Abstract

We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.

Paper Structure

This paper contains 19 sections, 11 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of LLM Selector. For an arbitrary pool of $n$ queries and a set of candidate language models, LLM Selector adaptively annotates most informative $b \ll n$ queries for identifying the best language model for the pool.
  • Figure 2: Candidate LLM win rate histograms.
  • Figure 3: Best model identification probability of LLM Selector and the baselines.