Table of Contents
Fetching ...

Scalable Prompt Routing via Fine-Grained Latent Task Discovery

Yunyi Zhang, Soji Adeshina, Sheng Guan, Ashwin Ganesh, Zhen Han, Vassilis N. Ioannidis, Huzefa Rangwala, George Karypis

Abstract

Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.

Scalable Prompt Routing via Fine-Grained Latent Task Discovery

Abstract

Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
Paper Structure (21 sections, 6 equations, 3 figures, 3 tables)

This paper contains 21 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of FineRouter. Top: Offline task type discovery via graph-based clustering, producing fine-grained tasks with candidate LLMs per task. Bottom: Online inference where the task classifier assigns prompts to discovered tasks, enabling task-specific adapter activation in the MoE router. Final model selection aggregates task-level scores (Stage 1) with prompt-specific quality predictions (Stage 2).
  • Figure 2: (a) Routing distribution of FineRouter across 11 candidate models. (b) Cost-performance comparison. FineRouter (curve) outperforms baseline routers (triangles) and individual LLMs (circles).
  • Figure 3: Prompt given to the LLM to generate task descriptions for offline task type discovery (Sect. \ref{['sec:task-type']}).