Table of Contents
Fetching ...

LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge

Xin Wang, Hualin Zhou, Sheng Guang Wang, Ting Dang, Yu Zhang, Hong Jia, Tao Gu

TL;DR

Vision-Language Models struggle to run robustly on edge devices due to distribution shifts and limited resources. The authors introduce LQA, a lightweight framework that jointly quantizes the backbone with modality-aware precision and performs gradient-free, cache-based test-time adaptation (Q-TTA) entirely in low precision. Core contributions include Selective Hybrid Quantization (SHQ) with Hessian-aware vision quantization and selective precision retention, plus a fully quantized Q-TTA mechanism that uses positive/negative exemplar caches. Across seven datasets with synthetic and real-world shifts, LQA achieves up to 19.9× memory savings and outperforms gradient-based TTA methods, enabling practical, privacy-preserving VLM deployment on edge devices.

Abstract

Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5\%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9$\times$ lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.

LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge

TL;DR

Vision-Language Models struggle to run robustly on edge devices due to distribution shifts and limited resources. The authors introduce LQA, a lightweight framework that jointly quantizes the backbone with modality-aware precision and performs gradient-free, cache-based test-time adaptation (Q-TTA) entirely in low precision. Core contributions include Selective Hybrid Quantization (SHQ) with Hessian-aware vision quantization and selective precision retention, plus a fully quantized Q-TTA mechanism that uses positive/negative exemplar caches. Across seven datasets with synthetic and real-world shifts, LQA achieves up to 19.9× memory savings and outperforms gradient-based TTA methods, enabling practical, privacy-preserving VLM deployment on edge devices.

Abstract

Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5\%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9 lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.
Paper Structure (32 sections, 8 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 32 sections, 8 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Diagram of the proposed LQA. (a) LQA first takes a stream of images into an image encoder along with pseudo class labels concatenated with hand crafted promptsradford2021learning, (b) Image and text inputs are encoded by quantized CLIP backbones to generate their features, (c) Predictions are made by cosine similarity of the embeddings, (d) Image features and predictions are used to update TDA cache (e) TDA cache is applied to predictions to form final predictions.
  • Figure 2: (a) Memory consumption and (b) latencies versus accuracy.
  • Figure 3: Comparison of runtime memory usage for different quantization methods.
  • Figure 4: Comparison of runtime latency across various quantization methods.
  • Figure 5: Average Accuracy of Different Quantization Method in 8-bit.
  • ...and 4 more figures