Table of Contents
Fetching ...

LLM Routing with Dueling Feedback

Chao-Kai Chiang, Takashi Ishida, Masashi Sugiyama

TL;DR

This work addresses dynamic LLM routing under weak supervision by casting the problem as online contextual dueling bandits with pairwise preference feedback. It introduces Category-Calibrated Fine-Tuning (CCFT) to produce category-aligned model embeddings and couples them with Feel-Good Thompson Sampling for Contextual Dueling Bandits (FGTS.CDB), a theoretically grounded posterior-sampling learner. Four categorical weighting variants, including cost-aware and top-k focused schemes, are proposed to balance user satisfaction, model expertise, and inference cost. Empirical evaluation on RouterBench and MixInstruct demonstrates lower cumulative regret, faster convergence, robustness to distribution shifts, and favorable cost-performance trade-offs compared to strong baselines using generic embedding models, highlighting practical viability for adaptive, label-efficient LLM routing.

Abstract

We study LLM routing, the problem of selecting the best model for each query while balancing user satisfaction, model expertise, and inference cost. We formulate routing as contextual dueling bandits, learning from pairwise preference feedback rather than absolute scores, thereby yielding label-efficient and dynamic adaptation. Building on this formulation, we introduce Category-Calibrated Fine-Tuning (CCFT), a representation-learning method that derives model embeddings from offline data using contrastive fine-tuning with categorical weighting. These embeddings enable the practical instantiation of Feel-Good Thompson Sampling for Contextual Dueling Bandits (FGTS.CDB), a theoretically grounded posterior-sampling algorithm. We propose four variants of the categorical weighting that explicitly integrate model quality and cost, and we empirically evaluate the proposed methods on the RouterBench and MixInstruct datasets. Across both benchmarks, our methods achieve lower cumulative regret and faster convergence, with better robustness and performance-cost balance than strong baselines built with a general-purpose OpenAI embedding model.

LLM Routing with Dueling Feedback

TL;DR

This work addresses dynamic LLM routing under weak supervision by casting the problem as online contextual dueling bandits with pairwise preference feedback. It introduces Category-Calibrated Fine-Tuning (CCFT) to produce category-aligned model embeddings and couples them with Feel-Good Thompson Sampling for Contextual Dueling Bandits (FGTS.CDB), a theoretically grounded posterior-sampling learner. Four categorical weighting variants, including cost-aware and top-k focused schemes, are proposed to balance user satisfaction, model expertise, and inference cost. Empirical evaluation on RouterBench and MixInstruct demonstrates lower cumulative regret, faster convergence, robustness to distribution shifts, and favorable cost-performance trade-offs compared to strong baselines using generic embedding models, highlighting practical viability for adaptive, label-efficient LLM routing.

Abstract

We study LLM routing, the problem of selecting the best model for each query while balancing user satisfaction, model expertise, and inference cost. We formulate routing as contextual dueling bandits, learning from pairwise preference feedback rather than absolute scores, thereby yielding label-efficient and dynamic adaptation. Building on this formulation, we introduce Category-Calibrated Fine-Tuning (CCFT), a representation-learning method that derives model embeddings from offline data using contrastive fine-tuning with categorical weighting. These embeddings enable the practical instantiation of Feel-Good Thompson Sampling for Contextual Dueling Bandits (FGTS.CDB), a theoretically grounded posterior-sampling algorithm. We propose four variants of the categorical weighting that explicitly integrate model quality and cost, and we empirically evaluate the proposed methods on the RouterBench and MixInstruct datasets. Across both benchmarks, our methods achieve lower cumulative regret and faster convergence, with better robustness and performance-cost balance than strong baselines built with a general-purpose OpenAI embedding model.

Paper Structure

This paper contains 31 sections, 1 theorem, 8 equations, 15 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Let $f_{km}$, $\{ \mathcal{Q}_m \}_{m=1}^{M}$, and $\{ \mathcal{G}_k \}_{k=1}^{K}$ be defined as above. Let $\mathbb{E}[Q_m]$ denote the expected embedding of queries in category $m$. Assume the embedding distribution within category $m$ is independent of label $k$This is a reasonable assumption, fo

Figures (15)

  • Figure 1: Failed versus successful examples.
  • Figure 2: OpenAItext results
  • Figure 3: e5b_E4 results
  • Figure 4: OpenAItext results
  • Figure 5: e5b_E4 results
  • ...and 10 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof : Proof of Proposition \ref{['thm:weight_without_scores']}