Table of Contents
Fetching ...

ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, Jie Liu

Abstract

Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.

ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

Abstract

Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
Paper Structure (49 sections, 14 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 49 sections, 14 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Cloud-Edge-Device LLM Collaboration System
  • Figure 2: Overview of the ConsRoute framework. The bottom right shows the Semantic Representation Extractor, which leverages the DLM to extract input semantics (Section \ref{['Semantic Representation']}). The bottom left shows the Training Data Construction process for the predictor and the top right shows the Lightweight Consistency Predictor, which decides which model tier a query should be routed to (Section \ref{['Lightweight Consistency Predictor']}). The top left presents the Adaptive Routing Policy, where appropriate routing thresholds are determined via Bayesian optimization (Section \ref{['Adaptive Threshold via Bayesian Optimization']}).
  • Figure 3: Prompt-guided representation learning. The user query is concatenated with a fixed instruction and an EOS token.The DLM processes the input and the final-layer hidden state of the EOS token is used as a consistency-aware representation for routing.
  • Figure 4: Comparison of consistency prediction signals.The left and middle plots show the relationship between score differences (LLM vs. DLM) from a reward model (Qwen2.5-PRM-7B) and BartScore, respectively, and human-annotated consistency labels. The right plot shows the same analysis using a reranker model (Qwen3-reranker-4B). The reranker score exhibits a stronger linear correlation with human labels, suggesting it better reflects true semantic consistency between responses.
  • Figure 5: Example illustrating the quality gap fails to reveal the semantic inconsistency. The reward model gives similar scores to both responses, while the reranker identifies their semantic mismatch.
  • ...and 8 more figures