MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, Tong Lu
TL;DR
<3-5 sentence high-level summary> MMFuser introduces a simple, effective mechanism to enrich visual representations in multimodal LLMs by dynamically fusing multi-layer features from a single Vision Transformer. By using deep features as queries to retrieve fine-grained details from shallow layers via cross-attention and refining with self-attention, MMFuser maintains semantic alignment while enhancing detail. When integrated with LLaVA-1.5, it yields consistent improvements across a wide range of benchmarks, including OCR and region-grounding tasks, without the overhead of multiple encoders. The work demonstrates that leveraging the full spectrum of ViT features can meaningfully boost fine-grained vision-language understanding and offers an efficient, flexible path for improving MLLMs.
Abstract
Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.
