ORI: O Routing Intelligence
Ahmad Shadid, Rahul Kumar, Mohit Mayank
TL;DR
ORI tackles task diversity and compute cost in large-language-model deployments by introducing a dynamic routing framework that uses vector-space embeddings and clustering to route each query to the most suitable LLM from a pool. It formalizes the problem as a routing optimization with $\\max \\sum_{i=1}^{n} P(p_i, m_j) R(p_i, m_j)$ under the constraint $\\sum_{j=1}^{k} R(p_i, m_j)=1$ and defines $\\text{Score}(m_j,b_k)$ to evaluate model performance per benchmark. The approach uses merged benchmark data, a Sentence Transformer for embeddings of size $384$, and clustering to guide routing, routing each new prompt to the dominant benchmark within its cluster. Empirically, ORI achieves state-of-the-art or competitive results on MMLU, BBH, MuSR, and ARC, while delivering favorable cost, speed, and latency trade-offs, demonstrating scalable, high-performance, multi-LLM deployment without reliance on human preference data.
Abstract
Single large language models (LLMs) often fall short when faced with the ever-growing range of tasks, making a single-model approach insufficient. We address this challenge by proposing ORI (O Routing Intelligence), a dynamic framework that leverages a set of LLMs. By intelligently routing incoming queries to the most suitable model, ORI not only improves task-specific accuracy, but also maintains efficiency. Comprehensive evaluations across diverse benchmarks demonstrate consistent accuracy gains while controlling computational overhead. By intelligently routing queries, ORI outperforms the strongest individual models by up to 2.7 points on MMLU and 1.8 points on MuSR, ties the top performance on ARC, and on BBH. These results underscore the benefits of a multi-model strategy and demonstrate how ORI's adaptive architecture can more effectively handle diverse tasks, offering a scalable, high-performance solution for a system of multiple large language models.
