Table of Contents
Fetching ...

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Xunzhuo Liu, Huamin Chen, Samzong Lu, Yossi Ovadia, Guohong Wen, Zhengda Tan, Jintao Zhang, Senan Zedan, Yehudit Kerido, Liav Weiss, Bishen Yu, Asaad Balum, Noa Limoy, Abdallah Samara, Brent Salisbury, Hao Wu, Ryan Cook, Zhijie Wang, Qiping Pan, Rehan Khan, Avishek Goswami, Houston H. Zhang, Shuyi Wang, Ziang Tang, Fang Han, Zohaib Hassan, Jianqiao Zheng, Avinash Changrani

Abstract

As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing -- selecting the right model for each query at inference time -- has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The central innovation is composable signal orchestration: the system extracts heterogeneous signal types from each request -- from sub-millisecond heuristic features (keyword patterns, language detection, context length, role-based authorization) to neural classifiers (domain, embedding similarity, factual grounding, modality) -- and composes them through configurable Boolean decision rules into deployment-specific routing policies. Different deployment scenarios -- multi-cloud enterprise, privacy-regulated, cost-optimized, latency-sensitive -- are expressed as different signal-decision configurations over the same architecture, without code changes. Matched decisions drive semantic model routing: over a dozen of selection algorithms analyze request characteristics to find the best model cost-effectively, while per-decision plugin chains enforce privacy and safety constraints (jailbreak detection, PII filtering, hallucination detection via the three-stage HaluGate pipeline). The system provides OpenAI API support for stateful multi-turn conversations, multi-endpoint and multi-provider routing across heterogeneous backends (vLLM, OpenAI, Anthropic, Azure, Bedrock, Gemini, Vertex AI), and a pluggable authorization factory supporting multiple auth providers. Deployed in production as an Envoy external processor, the architecture demonstrates that composable signal orchestration enables a single routing framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Abstract

As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing -- selecting the right model for each query at inference time -- has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The central innovation is composable signal orchestration: the system extracts heterogeneous signal types from each request -- from sub-millisecond heuristic features (keyword patterns, language detection, context length, role-based authorization) to neural classifiers (domain, embedding similarity, factual grounding, modality) -- and composes them through configurable Boolean decision rules into deployment-specific routing policies. Different deployment scenarios -- multi-cloud enterprise, privacy-regulated, cost-optimized, latency-sensitive -- are expressed as different signal-decision configurations over the same architecture, without code changes. Matched decisions drive semantic model routing: over a dozen of selection algorithms analyze request characteristics to find the best model cost-effectively, while per-decision plugin chains enforce privacy and safety constraints (jailbreak detection, PII filtering, hallucination detection via the three-stage HaluGate pipeline). The system provides OpenAI API support for stateful multi-turn conversations, multi-endpoint and multi-provider routing across heterogeneous backends (vLLM, OpenAI, Anthropic, Azure, Bedrock, Gemini, Vertex AI), and a pluggable authorization factory supporting multiple auth providers. Deployed in production as an Envoy external processor, the architecture demonstrates that composable signal orchestration enables a single routing framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.
Paper Structure (104 sections, 2 theorems, 37 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 104 sections, 2 theorems, 37 equations, 11 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

For any Boolean function $f: \{0,1\}^N \to \{0,1\}$ over signal match indicators, there exists a rule node $\phi$ using AND, OR, and NOT such that $\text{eval}(\phi, S(r)) = f(S(r))$ for all signal results $S(r)$.

Figures (11)

  • Figure 1: Three-layer architecture with closed-loop feedback. A deployment configuration $\Gamma$ selects which signals, decisions, and plugins are active. Layer 1 extracts a signal vector $\mathbf{s}$ from the request. Layer 2 evaluates Boolean decision formulas to select $d^*$. Layer 3 executes the per-decision plugin chain, selects a model from $d^*$'s candidate set, and routes to the provider endpoint. Response-side signals feed back to enable adaptive routing.
  • Figure 2: Signal extraction taxonomy and evaluation flow. An incoming request is evaluated in parallel against heuristic signals (sub-millisecond, deterministic) and learned signals (neural inference via LoRA classifiers). Only signal types referenced by configured decisions are computed (demand-driven evaluation). Results merge into the structured signal result $S(r)$.
  • Figure 3: Rule-node expression trees at increasing depth. (a) A flat depth-1 tree: AND over three leaf conditions. (b) A NOR expression: $\textsc{not}(\textsc{or}(\text{cs},\text{math}))$, matching all non-STEM queries. (c) An XOR expression composed from AND, OR, and NOT primitives, routing requests that match exactly one of two signals. Leaf nodes (gray) reference signal conditions; composite nodes use AND (blue), OR (green), and NOT (orange).
  • Figure 4: Three-level correspondence between combinational logic circuits and the decision engine. Level 1: A PLA with AND-plane and OR-plane corresponds to a flat (depth-1) decision formula. Level 2: A general combinational circuit with arbitrarily nested AND, OR, and NOT gates corresponds to a recursive rule-node tree within a single decision. Level 3: An array of circuits with a priority encoder corresponds to the full decision set with priority-ordered evaluation, realizing any routing policy.
  • Figure 5: HaluGate three-stage gated pipeline. The Sentinel classifies queries on the request path; non-factual queries (40--60%) skip verification entirely (dashed). For factual queries, the Detector identifies hallucinated spans in the model response, and the Explainer provides NLI-based diagnostics per span.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Definition 1: Deployment Configuration
  • Definition 2: Signal Rule
  • Definition 3: Signal Result
  • Definition 4: Decision
  • Definition 5: Rule Node --- Boolean Expression Tree
  • Proposition 1: Single-decision completeness
  • proof : Proof sketch
  • Proposition 2: Routing policy completeness
  • proof : Proof sketch
  • Definition 6: Fuzzy Rule-Node Evaluation
  • ...and 2 more