Table of Contents
Fetching ...

Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM

Penghao Wu, Lewei Lu, Ziwei Liu

TL;DR

This paper reveals computation-level redundancy in vision tokens within decoder-only large multimodal models and proposes ProxyV, a method that uses a small set of proxy vision tokens to shoulder heavy computations while full vision tokens are updated through lightweight guided modules. By downsampling vision tokens to proxies and propagating information back to full tokens, ProxyV significantly reduces prefilling FLOPs/time with no loss or even gains on several backbones, and a non-spatial variant enables seamless combination with token-reduction approaches. The approach is validated across multiple LLM backbones and benchmarks, showing effectiveness in preserving fine-grained visual understanding while improving efficiency, and it offers a flexible path to combine with existing token-reduction methods. Overall, ProxyV advances practical deployment of LMMs by mitigating computation without sacrificing (and sometimes enhancing) performance, especially on tasks requiring dense visual grounding.

Abstract

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM

TL;DR

This paper reveals computation-level redundancy in vision tokens within decoder-only large multimodal models and proposes ProxyV, a method that uses a small set of proxy vision tokens to shoulder heavy computations while full vision tokens are updated through lightweight guided modules. By downsampling vision tokens to proxies and propagating information back to full tokens, ProxyV significantly reduces prefilling FLOPs/time with no loss or even gains on several backbones, and a non-spatial variant enables seamless combination with token-reduction approaches. The approach is validated across multiple LLM backbones and benchmarks, showing effectiveness in preserving fine-grained visual understanding while improving efficiency, and it offers a flexible path to combine with existing token-reduction methods. Overall, ProxyV advances practical deployment of LMMs by mitigating computation without sacrificing (and sometimes enhancing) performance, especially on tasks requiring dense visual grounding.

Abstract

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

Paper Structure

This paper contains 13 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: ProxyV retains or increases the fine-grained benchmark performance while effectively reducing the computational cost. ProxyV-L12 and ProxyV-L16 denote applying ProxyV from layers 12 and 16, respectively.
  • Figure 2: The relative $\rm{Score_{fine}}$ with different vision attention masking ratios for different LLMs. The computation redundancy begins in the middle to rear part of the LLMs as masking the vision attention does not affect the performance.
  • Figure 3: Left: the vanilla LMM structure where full vision tokens cause significant computation. Right: the overall pipeline of the proposed ProxyV algorithm. The full vision tokens are first downsampled to obtain a much smaller version that works as proxy vision tokens. The proxy vision tokens participate in the original operations in the decoder layer including the self-attention and the FFNs to obtain useful information at a much lower cost. After this, each original vision token is guided by its spatially corresponding proxy vision token for an update through a lightweight MLP.
  • Figure 4: The illustration of the non-spatial ProxyV. Upper part: proxy vision tokens are generated as a weighted combination of full vision tokens through a simple attention operation. Lower part: The previous attention score is reused to splat the proxy vision tokens into guidance for the full vision tokens update. The softmax operations are skipped in the figure.
  • Figure 5: Cases where token reduction methods fail. Left: Token reduction methods fail to extract the complete dense information accurately. Right: Token reduction methods fail to retain critical visual information when the image contains diverse and dense visual details. In these cases, ProxyV retains all the visual information and successfully extracts the important visual details.