When to Reason: Semantic Router for vLLM
Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen
TL;DR
The paper addresses the cost-efficiency trade-off of reasoning in LLMs by introducing an intent-aware semantic router that selectively enables reasoning. The approach encodes prompts into semantic embeddings, classifies intent, and routes queries to lightweight or reasoning-enabled pathways, enabling adaptive reasoning within open-source serving stacks. Key contributions include identifying the need for semantic routing in open-source engines, implementing an open-source router that integrates with vLLM and Envoy/ext_proc, and demonstrating substantial gains on MMLU-Pro: an accuracy improvement of $10.24$ percentage points with latency reduced by $47.1\%$ and token usage by $48.5\%$ across 14 domains (statistically significant, $p<0.01$). This work offers a practical path to balance accuracy and efficiency in real-world LLM serving, particularly for knowledge-intensive tasks, while outlining areas for improvement in math and technical domains.
Abstract
Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems
