Table of Contents
Fetching ...

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Yufei Xue, Yushi Huang, Jiawei Shao, Lunjie Zhu, Chi Zhang, Xuelong Li, Jun Zhang

TL;DR

This work proposes VLMQ, A VLM-tailored PTQ framework that selectively prioritizes salient tokens while suppressing redundant ones during quantization, and introduces a gradient-driven importance factor to capture the token-wise importance variance.

Abstract

Post-training quantization (PTQ) has emerged as an effective technique for compressing large models and accelerating inference without retraining. While PTQ has been extensively studied in large language models (LLMs), its application to vision-language models (VLMs) remains underexplored. In this work, we identify two intrinsic characteristics of VLM activations: 1) visual over-representation, where vision tokens are excessive and often redundant, and 2) modality gap, which refers to the clear distribution gap between text and vision tokens in the latent feature space. Together, these two factors significantly deteriorate quantization performance but have been overlooked by existing PTQ methods. To address these challenges, we propose VLMQ, A VLM-tailored PTQ framework that selectively prioritizes salient tokens while suppressing redundant ones during quantization. In particular, we introduce a gradient-driven importance factor to capture the token-wise importance variance, the effectiveness of which is substantiated through both empirical and theoretical analysis. To ensure efficiency, we propose to use lightweight block-wise backpropagation for factor acquisition. Finally, we reformulate the optimization objective into an importance-aware form to preserve important activation information. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45\%} improvement on MME-RealWorld under 2-bit quantization.

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

TL;DR

This work proposes VLMQ, A VLM-tailored PTQ framework that selectively prioritizes salient tokens while suppressing redundant ones during quantization, and introduces a gradient-driven importance factor to capture the token-wise importance variance.

Abstract

Post-training quantization (PTQ) has emerged as an effective technique for compressing large models and accelerating inference without retraining. While PTQ has been extensively studied in large language models (LLMs), its application to vision-language models (VLMs) remains underexplored. In this work, we identify two intrinsic characteristics of VLM activations: 1) visual over-representation, where vision tokens are excessive and often redundant, and 2) modality gap, which refers to the clear distribution gap between text and vision tokens in the latent feature space. Together, these two factors significantly deteriorate quantization performance but have been overlooked by existing PTQ methods. To address these challenges, we propose VLMQ, A VLM-tailored PTQ framework that selectively prioritizes salient tokens while suppressing redundant ones during quantization. In particular, we introduce a gradient-driven importance factor to capture the token-wise importance variance, the effectiveness of which is substantiated through both empirical and theoretical analysis. To ensure efficiency, we propose to use lightweight block-wise backpropagation for factor acquisition. Finally, we reformulate the optimization objective into an importance-aware form to preserve important activation information. Extensive evaluations on 8 benchmarks across 0.5B32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45\%} improvement on MME-RealWorld under 2-bit quantization.

Paper Structure

This paper contains 59 sections, 1 theorem, 18 equations, 6 figures, 16 tables, 1 algorithm.

Key Result

theorem 1

The target loss perturbation $\Delta {\mathcal{L}}$ can be approximated by the first-order error as where $\theta \in {\mathbb{R}}^{D}$ and ${\mathbf{z}} \in {\mathbb{R}}^{Q}$ are the stacking weight being quantized and layer output, respectively.

Figures (6)

  • Figure 1: Baseline vs. VLMQ. The diagonal matrix in VLMQ represents the importance factors. The reported accuracy is from 2-bit Qwen2-VL-7B-Instruct wang2024qwen2vl quantized by GPTQ and VLMQ.
  • Figure 2: PCA-based shlens2014pca activation feature analysis with activations (4096 points) extracted from the pre-attention breakpoint of the $20$-th transformer layer in Qwen2-VL-7B-Instruct. The left two subfigures depict the activation feature distributions constructed from text-only and mixed text-vision activations, respectively. The right three subfigures visualize the distributions with varying token-wise importance factors. Light red/green and dark red/green points denote tokens classified as important or unimportant ones. Reported average accuracy is under INT3 quantization across eight vision-language benchmarks.
  • Figure 3: Visualization of normalized token-wise error ($\Delta {\mathbf{z}}$) and gradient (${\mathbf{p}}^{(\Delta{\mathbf{z}})}$). The red circle indicates the salient vision tokens. The magnitude of the error remains relatively stable across tokens, whereas the gradient varies across tokens in different modalities.
  • Figure 4: Derivation of Visualization of importance factors.
  • Figure 5: Pipeline of computing importance factors. The "Forward" module illustrates the quantization dataflow across decoding layers, where a breakpoint is set at the output of each attention module to compute the local loss ${\mathcal{L}}_\text{Block}$ and trigger a localized backward pass. The "Backward" module details the internal operations within an attention block, where gradients of each linear projection output are cached to derive token-level importance factors.
  • ...and 1 more figures

Theorems & Definitions (1)

  • theorem 1