ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

Xin Tang; Youfang Han; Fangfei Gou; Wei Zhao; Xin Meng; Yang Yu; Jinguo Zhang; Yuanchun Shi; Yuntao Wang; Tengxiang Zhang

ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

Xin Tang, Youfang Han, Fangfei Gou, Wei Zhao, Xin Meng, Yang Yu, Jinguo Zhang, Yuanchun Shi, Yuntao Wang, Tengxiang Zhang

TL;DR

ECVL-ROUTER tackles the need for scenario-aware routing in vision–language systems by introducing a Minimal Expected Score ($MES$) to capture user requirements across fast response, high quality, and low energy/privacy scenarios. A transformer-based router selects between edge SVLMs and cloud LVLMs, guided by a Routing Comprehensive Score ($RCS$) that combines Average Problem-Solving Probability ($APSP$), Cost Advantage ($CA$), and Average Inference Latency ($AIL$). The framework is trained on a dedicated Response Score Dataset ($RSD$) with responses scored by an LLM-based judge and validated against human labels, and it is evaluated against multiple baselines showing substantial edge utilization with minimal quality loss and significant latency reductions. The work provides actionable guidance for deploying edge–cloud VLM systems, including MES-driven decision rules, a tunable threshold $ au$, and open-source data and tooling for reproducibility and adaptation to domain-specific needs.

Abstract

Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.

ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

TL;DR

ECVL-ROUTER tackles the need for scenario-aware routing in vision–language systems by introducing a Minimal Expected Score (

) to capture user requirements across fast response, high quality, and low energy/privacy scenarios. A transformer-based router selects between edge SVLMs and cloud LVLMs, guided by a Routing Comprehensive Score (

) that combines Average Problem-Solving Probability (

), Cost Advantage (

), and Average Inference Latency (

). The framework is trained on a dedicated Response Score Dataset (

) with responses scored by an LLM-based judge and validated against human labels, and it is evaluated against multiple baselines showing substantial edge utilization with minimal quality loss and significant latency reductions. The work provides actionable guidance for deploying edge–cloud VLM systems, including MES-driven decision rules, a tunable threshold

, and open-source data and tooling for reproducibility and adaptation to domain-specific needs.

ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

TL;DR

Abstract

ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)