Table of Contents
Fetching ...

ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

Xin Tang, Youfang Han, Fangfei Gou, Wei Zhao, Xin Meng, Yang Yu, Jinguo Zhang, Yuanchun Shi, Yuntao Wang, Tengxiang Zhang

TL;DR

ECVL-ROUTER tackles the need for scenario-aware routing in vision–language systems by introducing a Minimal Expected Score ($MES$) to capture user requirements across fast response, high quality, and low energy/privacy scenarios. A transformer-based router selects between edge SVLMs and cloud LVLMs, guided by a Routing Comprehensive Score ($RCS$) that combines Average Problem-Solving Probability ($APSP$), Cost Advantage ($CA$), and Average Inference Latency ($AIL$). The framework is trained on a dedicated Response Score Dataset ($RSD$) with responses scored by an LLM-based judge and validated against human labels, and it is evaluated against multiple baselines showing substantial edge utilization with minimal quality loss and significant latency reductions. The work provides actionable guidance for deploying edge–cloud VLM systems, including MES-driven decision rules, a tunable threshold $ au$, and open-source data and tooling for reproducibility and adaptation to domain-specific needs.

Abstract

Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.

ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

TL;DR

ECVL-ROUTER tackles the need for scenario-aware routing in vision–language systems by introducing a Minimal Expected Score () to capture user requirements across fast response, high quality, and low energy/privacy scenarios. A transformer-based router selects between edge SVLMs and cloud LVLMs, guided by a Routing Comprehensive Score () that combines Average Problem-Solving Probability (), Cost Advantage (), and Average Inference Latency (). The framework is trained on a dedicated Response Score Dataset () with responses scored by an LLM-based judge and validated against human labels, and it is evaluated against multiple baselines showing substantial edge utilization with minimal quality loss and significant latency reductions. The work provides actionable guidance for deploying edge–cloud VLM systems, including MES-driven decision rules, a tunable threshold , and open-source data and tooling for reproducibility and adaptation to domain-specific needs.

Abstract

Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.

Paper Structure

This paper contains 69 sections, 8 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: (a) Overall Structure and (b)Training Strategy of ECVL-ROUTER
  • Figure 2: Impact of the decision threshold $\tau$. It illustrates how the performance metrics for each model router (ECVL-ROUTER, MLP, GBDT, MF) change with different values of $\tau$ for the InternVL-38B/1B pair at MES=6.
  • Figure 3: Overall score histogram across all models and datasets in RSD ($\sim$22k pairs). Dashed lines mark the mean ($\approx 5.58$) and median ($\approx 6.00$).
  • Figure 4: Score breakdown in RSD: (a) difficulty by dataset; (b) performance by model across the same instances.
  • Figure 5: Overall latency histogram (left) and percentile curve (right) across all $\sim$22k model–instance pairs. Dashed lines mark the mean ($\approx$1.31s) and median ($\approx$0.60s); tail extends beyond 5s at P99.
  • ...and 8 more figures