Table of Contents
Fetching ...

Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

Zhaoqi Xu, Yingying Zhang, Jian Li, Jianwei Guo, Qiannan Zhu, Hua Huang

TL;DR

InfoPrune introduces an information-theoretic framework for compressing vision-language models by aligning pruning with the Information Bottleneck principle. It jointly optimizes attention-head redundancy via an eRank-based objective and preserves task-relevant information via KS-based spectral alignment, while offering a training-based head pruning and a training-free FFN compression via adaptive low-rank SVD. The approach yields principled, adaptive pruning with up to 3.2x FLOP reduction and 1.8x speedups on multimodal benchmarks, with negligible accuracy loss, and provides theoretical guarantees for information preservation during compression. This work advances practical deployment of large VLMs by delivering a unified, theoretically grounded pathway to reduce computation without sacrificing semantic fidelity in multimodal reasoning.

Abstract

Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.

Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

TL;DR

InfoPrune introduces an information-theoretic framework for compressing vision-language models by aligning pruning with the Information Bottleneck principle. It jointly optimizes attention-head redundancy via an eRank-based objective and preserves task-relevant information via KS-based spectral alignment, while offering a training-based head pruning and a training-free FFN compression via adaptive low-rank SVD. The approach yields principled, adaptive pruning with up to 3.2x FLOP reduction and 1.8x speedups on multimodal benchmarks, with negligible accuracy loss, and provides theoretical guarantees for information preservation during compression. This work advances practical deployment of large VLMs by delivering a unified, theoretically grounded pathway to reduce computation without sacrificing semantic fidelity in multimodal reasoning.

Abstract

Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.

Paper Structure

This paper contains 29 sections, 3 theorems, 32 equations, 6 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Our objective is to minimize the redundancy in the pruned model. It can be proved that the difference in mutual information $I(X; Z_S) - I(X; Z)$ is equivalent to the difference in eRank between the pruned and original intermediate outputs, i.e.

Figures (6)

  • Figure 1: Normalized importance scores of attention heads across 32 layers in the Qwen2VL-7B visual modules. The scores are derived via min--max normalization of negative entropy values, reflecting each head’s relative contribution.
  • Figure 2: Row density ratio of the visual feed-forward network in Qwen2VL-7B. $W_1$ and $W_2$ denote two projection matrices, and $W_1 \times W_2$ represents their composition, as related to our method. The sparse ratio is computed via the $L_1$ norm; a row is considered sparse if its $L_1$ norm is below 5% of the input dimension.
  • Figure 3: Overview of our method. For head pruning, eRank is used to minimize the mutual information $I(X; Z_S)$ between input $X$ and pruned representation $Z_S$, while the KS distance maximizes $I(Y; Z_S)$ between the output $Y$ and $Z_S$. For FFN pruning, we employ SVD and the Eckart--Young theorem to automatically determine the retained rank under a predefined compression target.
  • Figure 4: A visualized description for head pruning. The method is divided into two main strategies: discarding redundant information and preserving meaningful information. On the left, the goal is to minimize $I(X; Z_S)$ to reduce redundancy, which is further refined by minimizing the eRank of the pruned model. On the right, the focus is on maximizing $I(Y; Z_S)$ to retain meaningful information, which involves SVD to select the first $k$ largest sigma and minimize the KS distance.
  • Figure 5: Visualization of head pruning results across different datasets.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 1: Head Pruning and eRank Optimization
  • Theorem 2: Head Pruning and KS Distance
  • Theorem 3: Adaptive FFN Pruning