Table of Contents
Fetching ...

Vision-Language Models for Edge Networks: A Comprehensive Survey

Ahmed Sharshar, Latif U. Khan, Waseem Ullah, Mohsen Guizani

TL;DR

This survey addresses the challenge of deploying Vision-Language Models on resource-constrained edge devices by surveying lightweight architectures, compression techniques (pruning, quantization, knowledge distillation), and efficient fine-tuning (prompts, adapters). It covers edge deployment pipelines, data handling, model partitioning between edge and cloud, and privacy/security considerations, with examples spanning healthcare, environmental monitoring, autonomous systems, and surveillance. Key contributions include a taxonomy of edge-focused VLM design choices, deployment strategies, and a discussion of open challenges (security, privacy, cross-modality learning, and communication). The work highlights practical implications for real-time, on-device multimodal processing and outlines future directions, including federated and context-aware learning, hardware-aware architectures, and robust edge ecosystems.

Abstract

Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.

Vision-Language Models for Edge Networks: A Comprehensive Survey

TL;DR

This survey addresses the challenge of deploying Vision-Language Models on resource-constrained edge devices by surveying lightweight architectures, compression techniques (pruning, quantization, knowledge distillation), and efficient fine-tuning (prompts, adapters). It covers edge deployment pipelines, data handling, model partitioning between edge and cloud, and privacy/security considerations, with examples spanning healthcare, environmental monitoring, autonomous systems, and surveillance. Key contributions include a taxonomy of edge-focused VLM design choices, deployment strategies, and a discussion of open challenges (security, privacy, cross-modality learning, and communication). The work highlights practical implications for real-time, on-device multimodal processing and outlines future directions, including federated and context-aware learning, hardware-aware architectures, and robust edge ecosystems.

Abstract

Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.

Paper Structure

This paper contains 39 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of integrating Vision-Language Models (VLMs) into edge networks. The diagram highlights IoT applications (e.g., healthcare, autonomous driving, smart homes, surveillance, gaming, education, fitness tracking, Industry 4.0, sports) that benefit from VLMs to address traditional Machine Learning (ML) challenges such as limited training data and semantic reasoning. It outlines key ML techniques (transfer learning, federated learning, semi-supervised learning) and model optimization methods (knowledge distillation, pruning, compression) for efficient deployment. Crucial enabling technologies, including edge computing, quantum communication and computing, and advanced wireless networks (5G/6G), are also presented to support VLM deployment at the edge
  • Figure 3: The APoLLo framework provides a unified approach to multi-modal adapter and prompt learning for Vision-Language Pretraining (VLP) models. It incorporates both image (yellow) and text (red) adapters, which are connected via cross-modal attention mechanisms to enhance alignment between the two modalities. Each modality processes augmented inputs: text generated by large language models (LLM) and images synthesized by text-conditioned diffusion models. This cross-modal interaction improves the coherence and performance of multi-modal tasks chowdhury2023apollo.
  • Figure 4: The MobileVLM architecture. Inputs include visual data $X_v \in \mathbb{R}^{N_v \times D_v}$ and textual queries $X_q$, where $N_v$ is the number of visual tokens and $D_v$ is the visual feature dimension. These are processed by a vision encoder and tokenizer, respectively, producing hidden states $H_v \in \mathbb{R}^{(N_v/4) \times D_v}$ and $H_q$. The visual features are passed through a Lightweight Downsample Projector (LDP), which efficiently compresses the input using depthwise and pointwise convolutions. The resulting features $H_v$ and $H_q$ are then fed into MobileLLaMA, a compact vision-language model, which generates the final response $Y_a$.mobilevlmv2.
  • Figure 5: The challenge of adapting large vision-language models to edge devices across different visual modalities. In this example, a resource-constrained cleaning robot equipped with RGB and depth cameras is used. The robot generates RGB-depth image pairs without scene labels. Using the pre-trained image encoder from CLIP as the teacher, the EdgeVL framework transfers knowledge to a smaller student encoder. This process requires no labels or human intervention, enabling the student model to directly process RGB or depth images for open-vocabulary scene classification on the device. EdgeVL distills the knowledge from the pre-trained visual encoder to the student model. In stage 2, it first fake-quantizes the pretrained student model, then uses contrastive learning to refine the student model edgevl2024.
  • Figure 6: Design Process of Distributed Edge VLMs.
  • ...and 5 more figures