Table of Contents
Fetching ...

edgeVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer

Chen Qian, Xinran Yu, Zewen Huang, Danyang Li, Qiang Ma, Fan Dang, Xuan Ding, Guangyong Shang, Zheng Yang

TL;DR

edgeVLM tackles real-time vision-language reasoning under cloud latency by introducing Context Transfer, reusing delayed LVLM outputs as semantic and visual priors for SVLM inference. The framework deploys two training-free modules, Context Replacement and Visual Focus, to refine textual history and steer attention toward salient image regions, respectively. Across multiple real-time tasks and datasets, edgeVLM consistently improves accuracy and robustness over traditional cloud–edge approaches while maintaining lower latency than LVLM-only schemes. This latency-aware collaboration paradigm promises more reliable real-time VLM systems in dynamic network conditions.

Abstract

Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design edgeVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision-lanuage reasoning tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.

edgeVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer

TL;DR

edgeVLM tackles real-time vision-language reasoning under cloud latency by introducing Context Transfer, reusing delayed LVLM outputs as semantic and visual priors for SVLM inference. The framework deploys two training-free modules, Context Replacement and Visual Focus, to refine textual history and steer attention toward salient image regions, respectively. Across multiple real-time tasks and datasets, edgeVLM consistently improves accuracy and robustness over traditional cloud–edge approaches while maintaining lower latency than LVLM-only schemes. This latency-aware collaboration paradigm promises more reliable real-time VLM systems in dynamic network conditions.

Abstract

Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design edgeVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision-lanuage reasoning tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.

Paper Structure

This paper contains 24 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of representative deployment and inference strategies for VLMs.
  • Figure 2: Overview of the proposed collaboration framework, edgeVLM. The system takes timestamped video streams as input and performs two parallel operations: uploading selected keyframes to the cloud-based LVLM for processing, and conducting local inference using the SVLM. Given potential cloud latency, delayed LVLM outputs are reused as historical context. These outputs guide the SVLM through two modules, Context Replacement and Visual Focus, to improve the quality of real-time predictions.
  • Figure 3: ROI-based region retrieving via cosine similarity. Given ROIs of traffic cone and pedestrian from frame T – d, cosine similarity highlights related regions in frame T to guide the attention of the SVLM.
  • Figure 4: Experimental latency evaluation of cloud services over a 5G network.
  • Figure 5: Absolute performance gain of edgeVLM over SVLM across different settings. (a) shows the impact of $K_{delay}$ with $K_{call} = 0$. (b) represents the impact of $K_{call}$ with $K_{delay} = 4$.
  • ...and 2 more figures