Table of Contents
Fetching ...

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, Tong Lu

TL;DR

<3-5 sentence high-level summary> MMFuser introduces a simple, effective mechanism to enrich visual representations in multimodal LLMs by dynamically fusing multi-layer features from a single Vision Transformer. By using deep features as queries to retrieve fine-grained details from shallow layers via cross-attention and refining with self-attention, MMFuser maintains semantic alignment while enhancing detail. When integrated with LLaVA-1.5, it yields consistent improvements across a wide range of benchmarks, including OCR and region-grounding tasks, without the overhead of multiple encoders. The work demonstrates that leveraging the full spectrum of ViT features can meaningfully boost fine-grained vision-language understanding and offers an efficient, flexible path for improving MLLMs.

Abstract

Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

TL;DR

<3-5 sentence high-level summary> MMFuser introduces a simple, effective mechanism to enrich visual representations in multimodal LLMs by dynamically fusing multi-layer features from a single Vision Transformer. By using deep features as queries to retrieve fine-grained details from shallow layers via cross-attention and refining with self-attention, MMFuser maintains semantic alignment while enhancing detail. When integrated with LLaVA-1.5, it yields consistent improvements across a wide range of benchmarks, including OCR and region-grounding tasks, without the overhead of multiple encoders. The work demonstrates that leveraging the full spectrum of ViT features can meaningfully boost fine-grained vision-language understanding and offers an efficient, flexible path for improving MLLMs.

Abstract

Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.

Paper Structure

This paper contains 30 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of feature maps from different vision encoders and various layers of CLIP-ViT. (a) Cosine similarity is computed between the feature maps from various vision encoders, including CLIP-ViT-L radford2021clip, ConvNeXt-XXL liu2022convnet, DINOv2-L oquab2023dinov2, EVA02-L fang2023eva02, and SigLIP-L zhai2023sigmoid, and the final-layer feature map of CLIP-ViT-L. (b) Visualization of different feature maps. These results indicate significant feature differences not only between various vision encoders but also across different layers within the same vision encoder. This observation motivates us to fully explore the potential of individual vision encoders for developing MLLMs.
  • Figure 2: Performance comparison across different model sizes. (a) Among 7B models, including Qwen-VL-Chat bai2023qwenvl, LLaVA-1.5-7B liu2023llava_1_5, our model surpasses LLaVA-1.5-7B on 11 out of 12 benchmarks, with an average score of 61.8 compared to LLaVA-1.5-7B's 60.3. (b) Among 13B models, including InstructBLIP instructblip and LLaVA-1.5-13B liu2023llava_1_5, our model also outperforms LLaVA-1.5-13B on 10 out of 12 benchmarks, achieving an average score of 64.1 compared to LLaVA-1.5-13B's 63.2. These results indicate that MMFuser can effectively improve the performance of LLaVA-1.5 models.
  • Figure 3: Previous methods vs. the proposed MMFuser. (a) Previous methods typically utilize visual features from the final or penultimate layer of the vision encoder. For example, the LLaVA series liu2023llavaliu2023llava_1_5 adopted this approach. (b) Some models integrate visual features from multiple vision encoders, such as MouSi fan2024mousi, DeepSeek-VL lu2024deepseek, and LLaVA-HR luo2024feast. (c) Our MMFuser fuses visual features from different layers of a single vision encoder, providing richer detail and better semantic alignment with text.
  • Figure 4: Overview of MMFuser. In MMFuser, feature maps from different layers of the vision encoder are strategically integrated to enhance the visual representations. Deep feature maps are employed as query elements, while shallow and intermediate feature maps are concatenated to form key and value elements. Through a dynamic attention-based fusion, MMFuser combines fine-grained details and higher-level semantic information. The fused features are then aligned with text using a projector and subsequently passed as inputs to LLMs.
  • Figure 5: Feature map visualization of MMFuser. For each image, we provide three types of output feature maps. The term "Key/Value" refers to the averaged feature maps from four selected shallow and intermediate layers of the ViT—specifically, layers 3, 8, 13, and 18—used as the key and value inputs in MMFuser. "Query" denotes the feature map from the penultimate layer of the ViT, serving as the query input in MMFuser and as the visual representations in prior MLLMs. "MMFuser Output" represents the feature map generated after applying the proposed MMFuser. As can be seen, the proposed MMFuser captures fine-grained details from shallow and intermediate ViT layers, enriching the visual representations for the LLM.
  • ...and 1 more figures