Guarded Query Routing for Large Language Models
Richard Šléher, William Brach, Tibor Sloboda, Kristián Košťál, Lukas Galke
TL;DR
This work reframes guarded query routing as a robust text-classification problem with out-of-distribution rejection, introducing GQR-Bench to benchmark routing strategies across law, finance, and healthcare while challenging LLM-only routing. It systematically compares large language models, guardrails, and efficient text classifiers (notably WideMLP and fastText) on ID/OOD performance and efficiency, revealing that lightweight classifiers can achieve near-LLM accuracy with orders-of-magnitude lower latency and storage. The key finding is that efficient classifiers, when paired with reliable OOD detection, offer substantial practical value for production routing, often matching or closely approaching the best LLM performance with far lower compute and cost. The work also exposes weaknesses in existing guardrail approaches for routing tasks and provides concrete guidance for deploying guarded routing systems in real-world settings, along with an open-source benchmark package.
Abstract
Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a guarded query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench, released as Python package gqr), covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP, fastText), and traditional machine learning models (SVM, XGBoost). Our results show that WideMLP, enhanced with out-of-domain detection capabilities, yields the best trade-off between accuracy (88%) and speed (<4ms). The embedding-based fastText excels at speed (<1ms) with acceptable accuracy (80%), whereas LLMs yield the highest accuracy (91%) but are comparatively slow (62ms for local Llama-3.1:8B and 669ms for remote GPT-4o-mini calls). Our findings challenge the automatic reliance on LLMs for (guarded) query routing and provide concrete recommendations for practical applications. Source code is available: https://github.com/williambrach/gqr.
