To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models
Junyan Lin, Haoran Chen, Dawei Zhu, Xiaoyu Shen
TL;DR
This study systematically evaluates how connectors between vision encoders and language models affect multimodal LLM performance across coarse-grained, fine-grained, and reasoning tasks. By classifying connectors as feature-preserving or feature-compressing and reclassifying sub-tasks on MMBench, MME, and SEED-Bench, the authors show that feature-preserving connectors excel in fine-grained perception, while feature-compressing connectors deliver speed advantages with competitive results on other tasks. The work further dissects pooling strategies (average pooling, attention pooling, convolutional mapping) and reveals how image resolution and token count mediate these effects, offering concrete recommendations for connector selection under resource constraints. Overall, the findings guide architecture design choices in MLLMs by balancing task demands with computational budgets, and highlight that higher resolutions diminish the relative gains of preserving features while amplifying training costs. $P$ and $Q$ were used to denote patch counts in the connector design, illustrating the fundamental trade-off between information retention and efficiency under varying perceptual granularities.
Abstract
In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emph{fine-grained perception} tasks due to their ability to retain detailed visual information. In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in \emph{coarse-grained perception} and \emph{reasoning} tasks. These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures.
