Table of Contents
Fetching ...

FoPru: Focal Pruning for Efficient Large Vision-Language Models

Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu

TL;DR

Focal Pruning (FoPru) is proposed, a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder that can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.

Abstract

Large Vision-Language Models (LVLMs) represent a significant advancement toward achieving superior multimodal capabilities by enabling powerful Large Language Models (LLMs) to understand visual input. Typically, LVLMs utilize visual encoders, such as CLIP, to transform images into visual tokens, which are then aligned with textual tokens through projection layers before being input into the LLM for inference. Although existing LVLMs have achieved significant success, their inference efficiency is still limited by the substantial number of visual tokens and the potential redundancy among them. To mitigate this issue, we propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Specifically, we introduce two alternative pruning strategies: 1) the rank strategy, which leverages all token significance scores to retain more critical tokens in a global view; 2) the row strategy, which focuses on preserving continuous key information in images from a local perspective. Finally, the selected tokens are reordered to maintain their original positional relationships. Extensive experiments across various LVLMs and multimodal datasets demonstrate that our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.

FoPru: Focal Pruning for Efficient Large Vision-Language Models

TL;DR

Focal Pruning (FoPru) is proposed, a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder that can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.

Abstract

Large Vision-Language Models (LVLMs) represent a significant advancement toward achieving superior multimodal capabilities by enabling powerful Large Language Models (LLMs) to understand visual input. Typically, LVLMs utilize visual encoders, such as CLIP, to transform images into visual tokens, which are then aligned with textual tokens through projection layers before being input into the LLM for inference. Although existing LVLMs have achieved significant success, their inference efficiency is still limited by the substantial number of visual tokens and the potential redundancy among them. To mitigate this issue, we propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Specifically, we introduce two alternative pruning strategies: 1) the rank strategy, which leverages all token significance scores to retain more critical tokens in a global view; 2) the row strategy, which focuses on preserving continuous key information in images from a local perspective. Finally, the selected tokens are reordered to maintain their original positional relationships. Extensive experiments across various LVLMs and multimodal datasets demonstrate that our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.

Paper Structure

This paper contains 30 sections, 4 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The attention map of CLIP in different layers.
  • Figure 2: The framework of Focal Pruning for LVLMs. First, we obtain the attention map in the vision encoder and calculate the token significance scores based on it. Next, we utilize alternative pruning strategies to prune the less important tokens and finally reorder the remaining tokens to recover relative positions.
  • Figure 3: The proportion of visual tokens and textual tokens in seven different datasets.
  • Figure 4: The CLIP model processes the input image (d) to generate the attention map (a), on which the token significance score is computed in FoPru. Rank and row pruning strategies are then applied, shown in (b) and (c), respectively. Figures (e) and (f) highlight the image regions selected by the rank and row strategies.
  • Figure 5: Performance metrics across visual token retention ratios for the LLaVA-1.6-7B and LLaVA-1.6-13B models on five datasets.
  • ...and 2 more figures