Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM
Penghao Wu, Lewei Lu, Ziwei Liu
TL;DR
This paper reveals computation-level redundancy in vision tokens within decoder-only large multimodal models and proposes ProxyV, a method that uses a small set of proxy vision tokens to shoulder heavy computations while full vision tokens are updated through lightweight guided modules. By downsampling vision tokens to proxies and propagating information back to full tokens, ProxyV significantly reduces prefilling FLOPs/time with no loss or even gains on several backbones, and a non-spatial variant enables seamless combination with token-reduction approaches. The approach is validated across multiple LLM backbones and benchmarks, showing effectiveness in preserving fine-grained visual understanding while improving efficiency, and it offers a flexible path to combine with existing token-reduction methods. Overall, ProxyV advances practical deployment of LMMs by mitigating computation without sacrificing (and sometimes enhancing) performance, especially on tasks requiring dense visual grounding.
Abstract
Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.
