Table of Contents
Fetching ...

Q-VLM: Post-training Quantization for Large Vision-Language Models

Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu

TL;DR

Large vision-language models are powerful but resource-intensive, hindering deployment on constrained devices. Q-VLM introduces post-training quantization that mines cross-layer dependency using an activation-entropy proxy to partition the model into blocks and perform block-wise rounding search, complemented by visual-encoder optimization to shrink the search space. The method formalizes a block-wise objective and a composite loss combining quantization error, entropy-guided layer weighting, and auto-regressive guidance, achieving substantial efficiency gains with negligible accuracy loss on LVLM benchmarks. Empirically, it delivers memory compression of about $2.78\times$ and a generation speed-up of about $1.44\times$ on roughly $13$B LLaVA, while surpassing state-of-the-art PTQ approaches across multiple LVLMs and datasets, including 4-bit settings. This work enables practical, scalable deployment of LVLMs on resource-constrained platforms, though ultra-low bitwidths remain challenging and future work will target embedded-device optimization.

Abstract

In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at https://github.com/ChangyuanWang17/QVLM.

Q-VLM: Post-training Quantization for Large Vision-Language Models

TL;DR

Large vision-language models are powerful but resource-intensive, hindering deployment on constrained devices. Q-VLM introduces post-training quantization that mines cross-layer dependency using an activation-entropy proxy to partition the model into blocks and perform block-wise rounding search, complemented by visual-encoder optimization to shrink the search space. The method formalizes a block-wise objective and a composite loss combining quantization error, entropy-guided layer weighting, and auto-regressive guidance, achieving substantial efficiency gains with negligible accuracy loss on LVLM benchmarks. Empirically, it delivers memory compression of about and a generation speed-up of about on roughly B LLaVA, while surpassing state-of-the-art PTQ approaches across multiple LVLMs and datasets, including 4-bit settings. This work enables practical, scalable deployment of LVLMs on resource-constrained platforms, though ultra-low bitwidths remain challenging and future work will target embedded-device optimization.

Abstract

In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at https://github.com/ChangyuanWang17/QVLM.

Paper Structure

This paper contains 16 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The overall pipeline of our method. We employ entropy as the proxy to represent cross-layer dependency for efficient block assignment, which decomposes the large search space from the entire model to blocks containing multiple layers. Moreover, the visual encoder is further optimized for fine-grained search space decomposition.
  • Figure 2: The correlation between discretization error difference (DED) and the activation entropy in 15th layer.
  • Figure 3: (a)The answering accuracy and searching cost w.r.t. different maximum layer depth within a block. (b) The answering accuracy and searching cost w.r.t. different hyperparameters across various vision-language models. (c) Quantization errors w.r.t. different maximum layer depth across various layers.
  • Figure 4: Visual reasoning examples from LLaVA-13B model. Q-VLM improves over the AWQ baseline for W4A4 quantization, reducing quantization errors and providing more reasonable answers. We color the text to show the correct or wrong responses.
  • Figure 5: (a) The correlation between discretization error difference (DED) and the quantization errors in 15th layer. (b) The correlation between DED and the entropy in 5th layer and (c) in 25th layer.