Table of Contents
Fetching ...

When to Reason: Semantic Router for vLLM

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

TL;DR

The paper addresses the cost-efficiency trade-off of reasoning in LLMs by introducing an intent-aware semantic router that selectively enables reasoning. The approach encodes prompts into semantic embeddings, classifies intent, and routes queries to lightweight or reasoning-enabled pathways, enabling adaptive reasoning within open-source serving stacks. Key contributions include identifying the need for semantic routing in open-source engines, implementing an open-source router that integrates with vLLM and Envoy/ext_proc, and demonstrating substantial gains on MMLU-Pro: an accuracy improvement of $10.24$ percentage points with latency reduced by $47.1\%$ and token usage by $48.5\%$ across 14 domains (statistically significant, $p<0.01$). This work offers a practical path to balance accuracy and efficiency in real-world LLM serving, particularly for knowledge-intensive tasks, while outlining areas for improvement in math and technical domains.

Abstract

Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems

When to Reason: Semantic Router for vLLM

TL;DR

The paper addresses the cost-efficiency trade-off of reasoning in LLMs by introducing an intent-aware semantic router that selectively enables reasoning. The approach encodes prompts into semantic embeddings, classifies intent, and routes queries to lightweight or reasoning-enabled pathways, enabling adaptive reasoning within open-source serving stacks. Key contributions include identifying the need for semantic routing in open-source engines, implementing an open-source router that integrates with vLLM and Envoy/ext_proc, and demonstrating substantial gains on MMLU-Pro: an accuracy improvement of percentage points with latency reduced by and token usage by across 14 domains (statistically significant, ). This work offers a practical path to balance accuracy and efficiency in real-world LLM serving, particularly for knowledge-intensive tasks, while outlining areas for improvement in math and technical domains.

Abstract

Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems

Paper Structure

This paper contains 13 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the proposed intent-aware semantic router. (a) Workflow of classification and routing; (b) system architecture.
  • Figure 2: Per-category accuracy across 14 MMLU-Pro domains for direct vLLM modes and our semantic router.
  • Figure 3: Per-category accuracy across all inference modes on MMLU-Pro.
  • Figure 4: Per-category average total tokens across all inference modes on MMLU-Pro. The semantic router consistently achieves the lowest token usage, reducing overhead in knowledge-centric domains (e.g., history, law, health) while remaining competitive in reasoning-heavy areas such as math and physics.
  • Figure 5: Per-category average response latency across all inference modes on MMLU-Pro. The semantic router reduces latency substantially compared to direct vLLM modes, particularly in domains with shorter factual queries (e.g., history, philosophy). Even in complex reasoning categories, the router sustains lower response times by avoiding unnecessary reasoning overhead.