Table of Contents
Fetching ...

Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling

Ze Feng, Jiang-jiang Liu, Sen Yang, Lingyu Xiao, Zhibin Quan, Zhenhua Feng, Wankou Yang, Jingdong Wang

TL;DR

This work tackles the efficiency-accuracy gap in LVLMs caused by dense vision tokens and information loss from vision feature compression. It introduces Vision Remember, a decoding-stage mechanism with Token-Feature Cross-Attention and Token Bidirectional Self-Attention to resample and recover original visual information across LVLM layers. Through extensive experiments on LLaVA-NeXT and other baselines, it achieves consistent improvements over prior efficient approaches, particularly in OCR and Chart understanding, and demonstrates robustness across multiple vision projectors. The results suggest Vision Remember as a general, plug-in component for building more efficient and capable LVLMs in real-world scenarios.

Abstract

The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for tasks relying on fine-grained spatial relationships, such as OCR and Chart&Table Understanding. In this paper, we propose to resample original vision features across the LLM decoder layers to recover visual information and attain efficiency. Following this principle, we introduce Vision Remember, which includes two key modules: (1) Token-Feature Cross-Attention Layer and (2) Token Bidirectional Self-Attention Layer. In the Token bidirectional attention, we employ self-attention mechanism to maintain the bidirectional interaction between vision tokens and the text-guided token. In the Token-Feature interaction attention, we introduce local cross-attention to resample the visual feature and utilize the multi-level fusion to enrich the visual representation. We conduct comprehensive experiments on multiple visual understanding benchmarks and the results with the LLaVA-NeXT baseline show that Vision Remember outperforms TokenPacker by +2.7 and FastV by +5.7 across nearly all the settings. Compared with previous vision feature re-fusion methods, our approach also surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline. The experimental results validate the generalization capability of the proposed method when combined with various efficient vision projectors and LVLMs.

Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling

TL;DR

This work tackles the efficiency-accuracy gap in LVLMs caused by dense vision tokens and information loss from vision feature compression. It introduces Vision Remember, a decoding-stage mechanism with Token-Feature Cross-Attention and Token Bidirectional Self-Attention to resample and recover original visual information across LVLM layers. Through extensive experiments on LLaVA-NeXT and other baselines, it achieves consistent improvements over prior efficient approaches, particularly in OCR and Chart understanding, and demonstrates robustness across multiple vision projectors. The results suggest Vision Remember as a general, plug-in component for building more efficient and capable LVLMs in real-world scenarios.

Abstract

The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for tasks relying on fine-grained spatial relationships, such as OCR and Chart&Table Understanding. In this paper, we propose to resample original vision features across the LLM decoder layers to recover visual information and attain efficiency. Following this principle, we introduce Vision Remember, which includes two key modules: (1) Token-Feature Cross-Attention Layer and (2) Token Bidirectional Self-Attention Layer. In the Token bidirectional attention, we employ self-attention mechanism to maintain the bidirectional interaction between vision tokens and the text-guided token. In the Token-Feature interaction attention, we introduce local cross-attention to resample the visual feature and utilize the multi-level fusion to enrich the visual representation. We conduct comprehensive experiments on multiple visual understanding benchmarks and the results with the LLaVA-NeXT baseline show that Vision Remember outperforms TokenPacker by +2.7 and FastV by +5.7 across nearly all the settings. Compared with previous vision feature re-fusion methods, our approach also surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline. The experimental results validate the generalization capability of the proposed method when combined with various efficient vision projectors and LVLMs.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Preliminary analysis. (a) Compressing the vision tokens can cause information loss, resulting in performance degradation. The proposed Vision Remember alleviates this problem. (b) We extract vision tokens from distinct components of LVLM and evaluate the classification accuracy on Tiny-ImageNet. The compression only happens in pooling. Our analysis identifies two primary sources of visual information loss: Information Bottleneck in Token Compression and Visual Cues Forgetting in Progressive Alignment.
  • Figure 2: Overview of the proposed Vision Remember. Left part: we insert Vision Remember between the LLM decoder layers to overcome the information bottleneck in token compression and visual cues forgetting in progressive alignment. Right part: Vision Remember consists of two key components: (1) Token-Feature Cross-Attention Layer (shown in the green part) and (2) Token Bidirectional Self-Attention Layer (shown in the gray part).
  • Figure 3: Local Cross Attention. We adopt the local cross attention in Token-Feature Cross-Attention Layer to address the issue of information bottleneck in token compression. A vision token only focuses on a $s\times s$ local region in the multi-level vision feature to improve the computational efficiency and capture the fine-grained spatial information.
  • Figure 4: Efficiency comparison on a NVIDIA A100 GPU. The compression ratio is 1/4. The radius of the circles represents the GPU memory used. Vision Remember achieves the best trade-off on efficiency and accuracy.