Table of Contents
Fetching ...

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

TL;DR

MBQ addresses the memory and compute bottlenecks of large vision-language models by revealing significant cross-modal sensitivity differences during quantization. It introduces Modality-Balanced Quantization (MBQ), which uses loss-gradient–based modality sensitivity signals to balance reconstruction losses across vision and language tokens, improving accuracy for both weight-only and weight-activation quantization. MBQ integrates with channel-wise equalization and rotation-based quantization, achieving up to 4–11% accuracy gains on 7B–72B VLMs and delivering practical end-to-end speedups (e.g., up to 1.4x) via fused GPU kernels. The approach demonstrates robust gains across diverse models and datasets, enabling more efficient deployment of large VLMs on commodity hardware while preserving language and vision capabilities.

Abstract

Vision-Language Models (VLMs) have enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead. Existing PTQ methods mainly focus on large language models (LLMs), without considering the differences across other modalities. In this paper, we discover that there is a significant difference in sensitivity between language and vision tokens in large VLMs. Therefore, treating tokens from different modalities equally, as in existing PTQ methods, may over-emphasize the insensitive modalities, leading to significant accuracy loss. To deal with the above issue, we propose a simple yet effective method, Modality-Balanced Quantization (MBQ), for large VLMs. Specifically, MBQ incorporates the different sensitivities across modalities during the calibration process to minimize the reconstruction loss for better quantization parameters. Extensive experiments show that MBQ can significantly improve task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for 7B to 70B VLMs, compared to SOTA baselines. Additionally, we implement a W3 GPU kernel that fuses the dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B on the RTX 4090. The code is available at https://github.com/thu-nics/MBQ.

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

TL;DR

MBQ addresses the memory and compute bottlenecks of large vision-language models by revealing significant cross-modal sensitivity differences during quantization. It introduces Modality-Balanced Quantization (MBQ), which uses loss-gradient–based modality sensitivity signals to balance reconstruction losses across vision and language tokens, improving accuracy for both weight-only and weight-activation quantization. MBQ integrates with channel-wise equalization and rotation-based quantization, achieving up to 4–11% accuracy gains on 7B–72B VLMs and delivering practical end-to-end speedups (e.g., up to 1.4x) via fused GPU kernels. The approach demonstrates robust gains across diverse models and datasets, enabling more efficient deployment of large VLMs on commodity hardware while preserving language and vision capabilities.

Abstract

Vision-Language Models (VLMs) have enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead. Existing PTQ methods mainly focus on large language models (LLMs), without considering the differences across other modalities. In this paper, we discover that there is a significant difference in sensitivity between language and vision tokens in large VLMs. Therefore, treating tokens from different modalities equally, as in existing PTQ methods, may over-emphasize the insensitive modalities, leading to significant accuracy loss. To deal with the above issue, we propose a simple yet effective method, Modality-Balanced Quantization (MBQ), for large VLMs. Specifically, MBQ incorporates the different sensitivities across modalities during the calibration process to minimize the reconstruction loss for better quantization parameters. Extensive experiments show that MBQ can significantly improve task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for 7B to 70B VLMs, compared to SOTA baselines. Additionally, we implement a W3 GPU kernel that fuses the dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B on the RTX 4090. The code is available at https://github.com/thu-nics/MBQ.
Paper Structure (33 sections, 10 equations, 2 figures, 11 tables)

This paper contains 33 sections, 10 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: The gradients of loss function w.r.t. the token features of the 13th transformer block in the LLaVA-onevision-7B VLM on COCO caption dataset coco-capsharegpt4v. The red and orange represent vision tokens, and the blue and green represent language tokens.
  • Figure 2: The inference process of Large VLMs. The blue patches represent language tokens, the red patches represent vision tokens.