Table of Contents
Fetching ...

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, Shuyue Hu

TL;DR

This work tackles the persistent trade-off between performance and cost in large language models by introducing Avengers-Pro, a test-time routing framework that ensembles eight heterogeneous models. Through embedding, clustering into $k=60$ clusters, and a tunable cluster-wise score $x_j^i = \alpha \, \tilde{p}_j^i + (1-\alpha) \ (1-\tilde{q}_j^i)$, the system dynamically routes each query to the most suitable model, achieving a Pareto frontier across six benchmarks. Empirical results show Avengers-Pro surpasses the strongest single model GPT‑5-medium by up to 7% in average accuracy at similar cost, and can match its performance with ~27% cost reduction or reach ~90% of that performance at ~63% cost reduction, illustrating substantial practical gains in efficiency. The approach is simple, training-free, and easily extensible to new models, enabling scalable, cost-aware deployment of multi-model LLM systems; code is available at the provided GitHub repository.

Abstract

Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

TL;DR

This work tackles the persistent trade-off between performance and cost in large language models by introducing Avengers-Pro, a test-time routing framework that ensembles eight heterogeneous models. Through embedding, clustering into clusters, and a tunable cluster-wise score , the system dynamically routes each query to the most suitable model, achieving a Pareto frontier across six benchmarks. Empirical results show Avengers-Pro surpasses the strongest single model GPT‑5-medium by up to 7% in average accuracy at similar cost, and can match its performance with ~27% cost reduction or reach ~90% of that performance at ~63% cost reduction, illustrating substantial practical gains in efficiency. The approach is simple, training-free, and easily extensible to new models, enabling scalable, cost-aware deployment of multi-model LLM systems; code is available at the provided GitHub repository.

Abstract

Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.

Paper Structure

This paper contains 14 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The Avengers Pro, a unified framework for dynamic routing, optimizes performance-efficiency trade-offs by intelligently ensembling language models.
  • Figure 2: Effects of the trade-off parameter $\alpha$ on the performance and efficiency. A greater value of $\alpha$ prioritizes performance over efficiency. The increase in performance is usually accompanied the increase in cost.
  • Figure 3: Proportion of model usage, given various trade-off parameters $\alpha$. When $\alpha$ is low, Avengers-Pro tend to route queries to Qwen3 and Qwen3-thinking. With a greater value of $\alpha$, Avengers-Pro favors GPT5-medium and Qwen3-thinking.