Table of Contents
Fetching ...

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Kaiwen Cai, Zhekai Duan, Gaowen Liu, Charles Fleming, Chris Xiaoxuan Lu

TL;DR

This work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.

Abstract

Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices, yet challenges in handling diverse visual modalities, manual annotation, and computational constraints remain. We introduce EdgeVL, a novel framework that bridges this gap by seamlessly integrating dual-modality knowledge distillation and quantization-aware contrastive learning. This approach enables the adaptation of large VL models, like CLIP, for efficient use with both RGB and non-RGB images on resource-limited devices without the need for manual annotations. EdgeVL not only transfers visual language alignment capabilities to compact models but also maintains feature quality post-quantization, significantly enhancing open-vocabulary classification performance across various visual modalities. Our work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

TL;DR

This work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.

Abstract

Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices, yet challenges in handling diverse visual modalities, manual annotation, and computational constraints remain. We introduce EdgeVL, a novel framework that bridges this gap by seamlessly integrating dual-modality knowledge distillation and quantization-aware contrastive learning. This approach enables the adaptation of large VL models, like CLIP, for efficient use with both RGB and non-RGB images on resource-limited devices without the need for manual annotations. EdgeVL not only transfers visual language alignment capabilities to compact models but also maintains feature quality post-quantization, significantly enhancing open-vocabulary classification performance across various visual modalities. Our work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.
Paper Structure (36 sections, 10 equations, 9 figures, 11 tables)

This paper contains 36 sections, 10 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: The adaptation problem of large visual language model to edge devices across visual modalities. We use a resource-constrained cleaning robot as the edge device for illustration. The robot has a co-located RGB and depth cameras, generating many paired images without scene labels. Using RGB-depth pairs as the inputs and the pre-trained image encoder in CLIP as the teacher, EdgeVL is designed to transfer the knowledge to a small student encoder without labels or human intervention. After this learning process, the student encoder can agnostically process either image modalities for open-vocabulary scene classification on the device.
  • Figure 2: Overall architecture of our proposed method. In stage-1, we distill the knowledge from the pre-trained visual encoder to the student model. In stage-2, we first fake-quantize the pretrained student model, then use contrastive learning to refine the student model.
  • Figure 3: Angles between the features of images and their corresponding text labels on the ScanNet dataset: We calculate the angles based on the cosine similarities (a lower cosine similarity corresponds to a greater angle between features han2022data). A rightward shift in the angle distribution in b) and c) suggests that $\theta_2 > \theta_1$, indicating that image features diverge from the text labels following PTQ. Conversely, a leftward shift implies $\theta_3 < \theta_1$, showing that image features align more closely with the text labels after Stage 2. Dashed lines denote mean values. Best viewed in color.
  • Figure 4: Visualization of the predictions of different models on ScanNet and EuroSAT. CLIP-G, CQD su2017adapting and SKD yang2022mixskd fall short for non-RGB images, while EdgeVL(Swin-T) demonstrates superior performance across both image modalities.
  • Figure 5: Quantization-aware matrix multiplication is used to take-quantize the student model
  • ...and 4 more figures