Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling
Ze Feng, Jiang-jiang Liu, Sen Yang, Lingyu Xiao, Zhibin Quan, Zhenhua Feng, Wankou Yang, Jingdong Wang
TL;DR
This work tackles the efficiency-accuracy gap in LVLMs caused by dense vision tokens and information loss from vision feature compression. It introduces Vision Remember, a decoding-stage mechanism with Token-Feature Cross-Attention and Token Bidirectional Self-Attention to resample and recover original visual information across LVLM layers. Through extensive experiments on LLaVA-NeXT and other baselines, it achieves consistent improvements over prior efficient approaches, particularly in OCR and Chart understanding, and demonstrates robustness across multiple vision projectors. The results suggest Vision Remember as a general, plug-in component for building more efficient and capable LVLMs in real-world scenarios.
Abstract
The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for tasks relying on fine-grained spatial relationships, such as OCR and Chart&Table Understanding. In this paper, we propose to resample original vision features across the LLM decoder layers to recover visual information and attain efficiency. Following this principle, we introduce Vision Remember, which includes two key modules: (1) Token-Feature Cross-Attention Layer and (2) Token Bidirectional Self-Attention Layer. In the Token bidirectional attention, we employ self-attention mechanism to maintain the bidirectional interaction between vision tokens and the text-guided token. In the Token-Feature interaction attention, we introduce local cross-attention to resample the visual feature and utilize the multi-level fusion to enrich the visual representation. We conduct comprehensive experiments on multiple visual understanding benchmarks and the results with the LLaVA-NeXT baseline show that Vision Remember outperforms TokenPacker by +2.7 and FastV by +5.7 across nearly all the settings. Compared with previous vision feature re-fusion methods, our approach also surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline. The experimental results validate the generalization capability of the proposed method when combined with various efficient vision projectors and LVLMs.
