Table of Contents
Fetching ...

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Shubhang Bhatnagar, Andy Xu, Kar-Han Tan, Narendra Ahuja

TL;DR

This paper proposes a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it and shows that using a mix of multimodal tokens for PTQ boosts VQA performance in the ultra-low bit regime.

Abstract

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

TL;DR

This paper proposes a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it and shows that using a mix of multimodal tokens for PTQ boosts VQA performance in the ultra-low bit regime.

Abstract

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.

Paper Structure

This paper contains 21 sections, 9 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Performance vs. Compression Trade-off for Qwen 2.5 VL. Our method, Layerwise Ultra-Low Bit Quantization (LUQ), achieves a better trade-off on the MME benchmark compared to AWQ and GPTQ baselines when used to quantize the multimodal LLM in the ultra-low bit regime.
  • Figure 2: Entropy of intermediate activation distributions of Multimodal vs Text only tokens in Qwen 2.5 VL. Activations produced by multimodal tokens have significantly higher entropy than purely text tokens, potentially explaining poorer resilience of multimodal LLMs to quantization.
  • Figure 3: An overview of our LUQ: Layerwise Ultra-Low Bit Quantization. (i) Generation of multimodal calibration tokens by passing multimodal data through a CLIP model augmented with a connector to align the modalities; (ii) Extraction of layerwise activations from the multimodal large language model (LLM); (iii) Entropy-based layer selection, where the entropy of activations is calculated to identify the layer most suitable for quantization, prioritizing layers with the lowest entropy to be quantized; (iv) Iterative quantization of layers, where candidate layers are quantized to ultra-low bit precision using existing post-training quantization (PTQ) algorithms. Quantization of each layer is followed by a checking step, where the performance/memory of the candidate LUQ model formed by combining all currently ultra-low bit quantized layers with higher bit layers is compared with a pre-defined memory or performance threshold. The iterations continue if the memory threshold is not met or if the model performs better than the performance threshold and (v) Once the iterative quantization process concludes, the layers quantized to different bit widths are combined back for inference.
  • Figure 3: Impact of calibration data composition (multimodal mix vs. text-only tokens) on post-training quantization (PTQ) of LLaVA-1.5. PTQ methods quantizing the model to less than 4-bits (LUQ and BiLLM) show higher VQA performance improvements with mixed multimodal token calibration, in contrast to 4-bit methods which exhibit marginal improvements.
  • Figure 4: Performance versus average bit-width for various post-training quantization methods using LLaVA 1.5 7B.(a) On the MME benchmark, LUQ significantly outperforms other methods for the LLaVA 1.5 model on the MME benchmark. (b) On the VQA v2 benchmark, LUQ maintains high accuracy for LLaVA 1.5 even at aggressive compression rates, whereas baseline methods show a sharp decline in performance.
  • ...and 5 more figures