Table of Contents
Fetching ...

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Junyan Lin, Haoran Chen, Dawei Zhu, Xiaoyu Shen

TL;DR

This study systematically evaluates how connectors between vision encoders and language models affect multimodal LLM performance across coarse-grained, fine-grained, and reasoning tasks. By classifying connectors as feature-preserving or feature-compressing and reclassifying sub-tasks on MMBench, MME, and SEED-Bench, the authors show that feature-preserving connectors excel in fine-grained perception, while feature-compressing connectors deliver speed advantages with competitive results on other tasks. The work further dissects pooling strategies (average pooling, attention pooling, convolutional mapping) and reveals how image resolution and token count mediate these effects, offering concrete recommendations for connector selection under resource constraints. Overall, the findings guide architecture design choices in MLLMs by balancing task demands with computational budgets, and highlight that higher resolutions diminish the relative gains of preserving features while amplifying training costs. $P$ and $Q$ were used to denote patch counts in the connector design, illustrating the fundamental trade-off between information retention and efficiency under varying perceptual granularities.

Abstract

In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emph{fine-grained perception} tasks due to their ability to retain detailed visual information. In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in \emph{coarse-grained perception} and \emph{reasoning} tasks. These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures.

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

TL;DR

This study systematically evaluates how connectors between vision encoders and language models affect multimodal LLM performance across coarse-grained, fine-grained, and reasoning tasks. By classifying connectors as feature-preserving or feature-compressing and reclassifying sub-tasks on MMBench, MME, and SEED-Bench, the authors show that feature-preserving connectors excel in fine-grained perception, while feature-compressing connectors deliver speed advantages with competitive results on other tasks. The work further dissects pooling strategies (average pooling, attention pooling, convolutional mapping) and reveals how image resolution and token count mediate these effects, offering concrete recommendations for connector selection under resource constraints. Overall, the findings guide architecture design choices in MLLMs by balancing task demands with computational budgets, and highlight that higher resolutions diminish the relative gains of preserving features while amplifying training costs. and were used to denote patch counts in the connector design, illustrating the fundamental trade-off between information retention and efficiency under varying perceptual granularities.

Abstract

In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emph{fine-grained perception} tasks due to their ability to retain detailed visual information. In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in \emph{coarse-grained perception} and \emph{reasoning} tasks. These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures.

Paper Structure

This paper contains 26 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of radar chart performance at 224, 336, and 448 resolutions across coarse-grained perception, fine-grained perception, and reasoning tasks on MMBench. Each task includes four sub-tasks: Image Quality, Image Scene, Image Style, and Image Topic for coarse-grained perception; Action Recognition, Celebrity Recognition, Object Localization, and OCR for fine-grained perception; and Function Reasoning, Identity Reasoning, Social Relation, and Structuralized Image-Text Understanding for reasoning tasks.
  • Figure 2: The structure of different visual-language connectors. The upper part of the figure shows the overall structure of various connectors, while the lower part provides a simplified visualization during compression. (a) The Average Pooling-based connector compresses features by averaging visual tokens within local windows (b) The Attention Pooling-based connector uses cross-attention between learnable queries and visual tokens to abstract visual tokens into a certain number of compressed tokens. Each compressed token is derived from all visual tokens with weighted contributions. (c) The Convolutional Mapping-based connector uses convolution operations to enhances local context modeling while reducing the number of tokens. Each compressed token is derived from the visual tokens within local windows with weighted contributions.
  • Figure 3: Examples of conflicting partition criterion for perception granularity in the MME benchmark.
  • Figure 4: Comparison of two-layer MLP and linear connectors on coarse-grained, fine-grained perception, and reasoning tasks at resolutions of 224, 336, and 448.
  • Figure 5: Analysis of the impact of different compressed token numbers on the performance of coarse-grained perception (C), fine-grained perception (F), and reasoning (R) tasks.
  • ...and 1 more figures