Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, Shuyue Hu
TL;DR
This work tackles the persistent trade-off between performance and cost in large language models by introducing Avengers-Pro, a test-time routing framework that ensembles eight heterogeneous models. Through embedding, clustering into $k=60$ clusters, and a tunable cluster-wise score $x_j^i = \alpha \, \tilde{p}_j^i + (1-\alpha) \ (1-\tilde{q}_j^i)$, the system dynamically routes each query to the most suitable model, achieving a Pareto frontier across six benchmarks. Empirical results show Avengers-Pro surpasses the strongest single model GPT‑5-medium by up to 7% in average accuracy at similar cost, and can match its performance with ~27% cost reduction or reach ~90% of that performance at ~63% cost reduction, illustrating substantial practical gains in efficiency. The approach is simple, training-free, and easily extensible to new models, enabling scalable, cost-aware deployment of multi-model LLM systems; code is available at the provided GitHub repository.
Abstract
Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.
