Table of Contents
Fetching ...

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo

TL;DR

This work builds a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.

Abstract

Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in the number of visual tokens input into LLMs, resulting in significant computational costs. Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance. We argue that removing visual redundancy can simultaneously improve both efficiency and performance. We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.With these two modules, the proposed FocusLLaVA achieves improvements in both efficiency and performance. We validate the effectiveness of our approach on a wide range of evaluation datasets.

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

TL;DR

This work builds a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.

Abstract

Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in the number of visual tokens input into LLMs, resulting in significant computational costs. Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance. We argue that removing visual redundancy can simultaneously improve both efficiency and performance. We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.With these two modules, the proposed FocusLLaVA achieves improvements in both efficiency and performance. We validate the effectiveness of our approach on a wide range of evaluation datasets.

Paper Structure

This paper contains 16 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: (a) Overall structure of FocusLLaVA. The two core modules are vision-guided sampler and text-guided sampler. The features from each sub-images are first concatenated into a whole and then partitioned by regions, each forming a local feature block. It is then processed by vision-guided sampler. (b) The structure of vision-guided sampler. It takes a feature block and global image's features as inputs and output the predicted sampling scale for this region. (c) The structure of text-guided sampler. It aggregates the multi-head attention scores to form the importance map of visual tokens.
  • Figure 2: Performance and speed with textual guidance in different layers.
  • Figure 3: Statistics of multi-scale sampling. For each sample, the proportion of the three types of max-pooling is shown.
  • Figure 4: Heatmap of selected areas from vision-guided sampler and text-guided sampler. For each image set, from left to right: the original image, the heatmap of vision-guided sampler, the heatmap of text-guided sampler.
  • Figure 5: The importance map from different layers of LLM. There are 32 layers in total. We select the importance map every two layers. The maps are arranged in reading order.
  • ...and 5 more figures