CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
Ruqi Liao, Chuqing Zhao, Jin Li, Weiqi Feng
TL;DR
This paper targets the high inference cost of large multimodal models like BLIP-2 by introducing Cross-Attention Token Pruning (CATP), which uses cross-attention signals from the Q-Former to rank query tokens for pruning. CATP employs a voting strategy across multiple heads and layers to derive robust token importance scores, enabling end-to-end post-training pruning with minimal accuracy loss. Empirical results on a 10% VQA subset show CATP achieving up to 12.1x accuracy gains over self-attention baselines and substantial improvements over L2-norm pruning, with further gains from image-token weighting and layer-importance analyses. The work demonstrates that cross-attention cues can effectively preserve multimodal task performance while reducing computational burden, offering practical benefits for deploying large multimodal systems.
Abstract
In response to the rising interest in large multimodal models, we introduce Cross-Attention Token Pruning (CATP), a precision-focused token pruning method. Our approach leverages cross-attention layers in multimodal models, exemplified by BLIP-2, to extract valuable information for token importance determination. CATP employs a refined voting strategy across model heads and layers. In evaluations, CATP achieves up to 12.1X higher accuracy compared to existing token pruning methods, addressing the trade-off between computational efficiency and model precision.
