Table of Contents
Fetching ...

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Xinhao Wang, Zhonyu Xia, Zhiwei Lin, Zhe Li, Yongtao Wang

Abstract

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Abstract

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

Paper Structure

This paper contains 13 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Motivation teaser of quantization-aware vision token pruning on a real ScienceQA sample. Panel (a) shows the input image. Panel (b) shows the token-level outlier scores, with darker red cells indicating greater quantization sensitivity. Panel (c) shows the tokens kept by semantic-only pruning, which misses the highest-scoring outlier token and leads the quantized model to predict Rhode Island. Panel (d) shows our selection, which preserves that token and recovers the correct answer South Carolina.
  • Figure 2: Overview of the proposed quantization-aware vision token pruning framework. Given input visual tokens and the query text, the model computes three complementary signals: group-wise quantization error, global outlier intensity, and semantic relevance. The first two signals are combined into a quantization sensitivity score $\mathbf{S}^{Q}$, which is further fused with the semantic pruning score $\mathbf{S}^{P}$ to produce the final score for selecting the top-$K$ visual tokens.
  • Figure 3: Normalized accuracy retention versus retained visual-token ratio for LLaVA-7B and LLaVA-13B under W4A4 PTQ. Each curve is normalized by the dense W4A4 baseline of its model. As the token budget shrinks, semantic-only pruning degrades more noticeably, while our method remains consistently closer to, and often above, the dense PTQ baseline.