Dense Connector for MLLMs

Huanjin Yao; Wenhao Wu; Taojiannan Yang; YuXin Song; Mengxi Zhang; Haocheng Feng; Yifan Sun; Zhiheng Li; Wanli Ouyang; Jingdong Wang

Dense Connector for MLLMs

Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang

TL;DR

Dense Connector tackles the underutilization of visual signals in MLLMs by fusing multi‑layer features from frozen vision encoders into the LLM via three lightweight instantiations (STI, SCI, DCI). An Efficient Dense Connector reduces token counts with minimal accuracy loss, and a training‑free extension enables video understanding. Across multiple vision backbones, image resolutions, training data scales, and LLM sizes (2.7B–70B), Dense Connector yields strong gains and achieves state‑of‑the‑art results on 19 image/video benchmarks, including competitive video performance with a training‑free approach. The work demonstrates the practical viability and wide applicability of dense multi‑layer visual integration for improving cross‑modal reasoning in MLLMs.

Abstract

Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA-v1.5, LLaVA-NeXT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https://github.com/HJYao00/DenseConnector .

Dense Connector for MLLMs

TL;DR

Abstract

Paper Structure (21 sections, 8 equations, 13 figures, 8 tables)

This paper contains 21 sections, 8 equations, 13 figures, 8 tables.

Introduction
Related Work
Large Pre-trained Vision Models
Large Language Models
Multimodal Large Language Models
Method
Overview
Dense Connector
Efficient Dense Connector for Visual Token Optimization
Training-Free Extension from Image to Video Conversational Models
Experiments
Implementation Details
Ablation Study
Main Results
Conclusion and Limitation
...and 6 more sections

Figures (13)

Figure 1: Exploring Multi-layer Visual Features Empowering existing MLLMs.
Figure 2: Dense Connector in MLLM: Overview and Three Instantiations. $N$ is the number of tokens, $D$ is the feature dimension, and $\alpha$ is the downsampling ratio.
Figure 3: Quantitative Results for Image and Video dialogues. Figures (a) through (d) pertain to image understanding, while figures (e) and (f) relate to video understanding.
Figure 4: Comparison of three instantiations of Dense Connector with LLaVA-1.5. STI stands for Sparse Token Integration, SCI for Sparse Channel Integration and DCI for Dense Channel Integration.
Figure 5: Qualitative results of the flowchart understanding.
...and 8 more figures

Dense Connector for MLLMs

TL;DR

Abstract

Dense Connector for MLLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (13)