Table of Contents
Fetching ...

CASP: Compression of Large Multimodal Models Based on Attention Sparsity

Mohsen Gholami, Mohammad Akbari, Kevin Cannons, Yong Zhang

TL;DR

CASP introduces a finetuning-free, data-aware compression for large multimodal models by exploiting attention sparsity: it first applies a data-driven low-rank decomposition to the Query and Key weights in a whitened space, then performs per-layer bit allocation for quantization under a fixed budget. The authors prove a theoretical bound showing that compression error on $W_q$ and $W_k$ is controlled by the sparsity of the attention map, and provide an optimal bit-allocation formula to maximize reconstruction quality under a target average bit rate. Empirically, CASP consistently improves state-of-the-art 2-bit quantization baselines across image-language, video-language, and language-only benchmarks, while remaining compatible with existing PTQ methods and applicable to LLMs as well. The approach delivers substantial practical benefits, enabling highly compressed multimodal models with minimal fine-tuning and broad potential impact on efficient model deployment.

Abstract

In this work, we propose an extreme compression technique for Large Multimodal Models (LMMs). While previous studies have explored quantization as an efficient post-training compression method for Large Language Models (LLMs), low-bit compression for multimodal models remains under-explored. The redundant nature of inputs in multimodal models results in a highly sparse attention matrix. We theoretically and experimentally demonstrate that the attention matrix's sparsity bounds the compression error of the Query and Key weight matrices. Based on this, we introduce CASP, a model compression technique for LMMs. Our approach performs a data-aware low-rank decomposition on the Query and Key weight matrix, followed by quantization across all layers based on an optimal bit allocation process. CASP is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks.

CASP: Compression of Large Multimodal Models Based on Attention Sparsity

TL;DR

CASP introduces a finetuning-free, data-aware compression for large multimodal models by exploiting attention sparsity: it first applies a data-driven low-rank decomposition to the Query and Key weights in a whitened space, then performs per-layer bit allocation for quantization under a fixed budget. The authors prove a theoretical bound showing that compression error on and is controlled by the sparsity of the attention map, and provide an optimal bit-allocation formula to maximize reconstruction quality under a target average bit rate. Empirically, CASP consistently improves state-of-the-art 2-bit quantization baselines across image-language, video-language, and language-only benchmarks, while remaining compatible with existing PTQ methods and applicable to LLMs as well. The approach delivers substantial practical benefits, enabling highly compressed multimodal models with minimal fine-tuning and broad potential impact on efficient model deployment.

Abstract

In this work, we propose an extreme compression technique for Large Multimodal Models (LMMs). While previous studies have explored quantization as an efficient post-training compression method for Large Language Models (LLMs), low-bit compression for multimodal models remains under-explored. The redundant nature of inputs in multimodal models results in a highly sparse attention matrix. We theoretically and experimentally demonstrate that the attention matrix's sparsity bounds the compression error of the Query and Key weight matrices. Based on this, we introduce CASP, a model compression technique for LMMs. Our approach performs a data-aware low-rank decomposition on the Query and Key weight matrix, followed by quantization across all layers based on an optimal bit allocation process. CASP is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks.

Paper Structure

This paper contains 23 sections, 19 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: CASP considers the specific properties of LMMs and offers significant improvement over state-of-the-art model quantization methods. PPL: perplexity.
  • Figure 2: Left column: Comparison of LLaVa-Next-Video-7B and Llama2-7B attention maps ($S$ in Eq. \ref{['eq:att_mat']}) at Layer 15 . Despite LLaVa-Next-Video using Llama2 as its base LLM, there is a notable difference in their maps, with LLaVa showing high sparsity. Middle column: Attention maps when $W_q$ and $W_k$ (i.e., attention weights) are 94% compressed (equivalent to 1 bit). Right column: Compression errors ($E$ in Eq. \ref{['eq:error']}). The sparsity in LLaVa's map results in smaller errors when compressing $W_q$ and $W_k$.
  • Figure 3: Compression error $E$ (Eq. \ref{['eq:error']}) for LLaVa-Next-Video and LLaVa1.5 decreases when the percentage of visual tokens increases (i.e., more sparse attention map).
  • Figure 4: Optimal bit computation by Eq. \ref{['eq:quant2']} for different $\mu$ values. Note that the layer bit (i.e. $b_l$ in Eq. \ref{['eq:quant2']}) only accounts for the compression obtained from quantization, not the low-rank decomposition. Therefore, the average layer bit from the above plot is not the actual average bit of the model (i.e. $B_{\text{avg}}$ in Eq. \ref{['eq:quant2']}).
  • Figure 5: Qualitative results from LiveBench dataset. The GPT-4o scores out of 10 are shown for each method.
  • ...and 5 more figures