Table of Contents
Fetching ...

Smoothie: Label Free Language Model Routing

Neel Guha, Mayee F. Chen, Trevor Chow, Ishan S. Khare, Christopher Ré

TL;DR

Smoothie tackles the problem of unsupervised routing among a pool of LLMs for diverse tasks by treating per-sample LLM quality as a latent variable in a weak supervision–inspired Gaussian graphical model. It defines observable embeddings for each LLM output and a latent true output, deriving a closed-form estimator for per-sample quality scores $ heta_i(x)$ from embedding distances; a local variant uses kernel smoothing over nearest neighbors to make the estimates sample-dependent. Routing selects the LLM with the highest $ heta_i(x)$, yielding two practical instantiations: Smoothie-Global (uses all test data for global estimates) and Smoothie-Local (uses neighborhood-based, sample-conditioned estimates). Empirically, Smoothie-Global shows strong correlation with ground-truth model quality ($ ho \

Abstract

Large language models (LLMs) are increasingly used in applications where LLM inputs may span many different tasks. Recent work has found that the choice of LLM is consequential, and different LLMs may be good for different input samples. Prior approaches have thus explored how engineers might select an LLM to use for each sample (i.e. routing). While existing routing methods mostly require training auxiliary models on human-annotated data, our work explores whether it is possible to perform unsupervised routing. We propose Smoothie, a weak supervision-inspired routing approach that requires no labeled data. Given a set of outputs from different LLMs, Smoothie constructs a latent variable graphical model over embedding representations of observable LLM outputs and unknown "true" outputs. Using this graphical model, we estimate sample-dependent quality scores for each LLM, and route each sample to the LLM with the highest corresponding score. We find that Smoothie's LLM quality-scores correlate with ground-truth model quality (correctly identifying the optimal model on 9/14 tasks), and that Smoothie outperforms baselines for routing by up to 10 points accuracy.

Smoothie: Label Free Language Model Routing

TL;DR

Smoothie tackles the problem of unsupervised routing among a pool of LLMs for diverse tasks by treating per-sample LLM quality as a latent variable in a weak supervision–inspired Gaussian graphical model. It defines observable embeddings for each LLM output and a latent true output, deriving a closed-form estimator for per-sample quality scores from embedding distances; a local variant uses kernel smoothing over nearest neighbors to make the estimates sample-dependent. Routing selects the LLM with the highest , yielding two practical instantiations: Smoothie-Global (uses all test data for global estimates) and Smoothie-Local (uses neighborhood-based, sample-conditioned estimates). Empirically, Smoothie-Global shows strong correlation with ground-truth model quality ($ ho \

Abstract

Large language models (LLMs) are increasingly used in applications where LLM inputs may span many different tasks. Recent work has found that the choice of LLM is consequential, and different LLMs may be good for different input samples. Prior approaches have thus explored how engineers might select an LLM to use for each sample (i.e. routing). While existing routing methods mostly require training auxiliary models on human-annotated data, our work explores whether it is possible to perform unsupervised routing. We propose Smoothie, a weak supervision-inspired routing approach that requires no labeled data. Given a set of outputs from different LLMs, Smoothie constructs a latent variable graphical model over embedding representations of observable LLM outputs and unknown "true" outputs. Using this graphical model, we estimate sample-dependent quality scores for each LLM, and route each sample to the LLM with the highest corresponding score. We find that Smoothie's LLM quality-scores correlate with ground-truth model quality (correctly identifying the optimal model on 9/14 tasks), and that Smoothie outperforms baselines for routing by up to 10 points accuracy.

Paper Structure

This paper contains 31 sections, 1 theorem, 7 equations, 8 figures, 11 tables, 1 algorithm.

Key Result

Proposition 1

shin2022universalizing For any $i, j \in [m]$, it follows from the graphical model in eq:pgm that

Figures (8)

  • Figure 1: For a given input $x$, Smoothie estimates the quality of every LLM ensemble's generation, and uses this quality weight to route $x$ to a single LLM.
  • Figure 2: (a) Spearman's rank correlation coefficient between Smoothie-Global weights and ground-truth LLM performance for 3B and 7B ensembles across NLG tasks. (b)Smoothie-Global's improvement over Random by win-rate on AlpacaEval. (c)Smoothie-Global's improvement over Random by length-controlled win-rate on AlpacaEval.
  • Figure 3: On Distr-Acc and Distr-Rouge2, we measure how frequently Smoothie-Local selects the $i$-th best generation across the ensemble, for both the 3B and 7B ensembles.
  • Figure 4: We compare Random (blue) and Smoothie-Global (orange) for prompt-selection on different sized models in the Pythia suite. The x-axis denotes model size, and the y-axis denotes performance (either rouge2 or accuracy).
  • Figure 5: We measure how Smoothie-Local's performance on Distr-Acc changes as $n_0$ changes.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Proposition 1