Table of Contents
Fetching ...

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Yizhou Wang, Song Mao, Yang Chen, Yufan Shen, Yinqiao Yan, Pinlong Cai, Ding Wang, Guohang Yan, Zhi Yu, Xuming Hu, Botian Shi

TL;DR

This work investigates encoder redundancy in multimodal large language models that use multiple vision encoders. By systematically masking encoders and introducing the Conditional Utilization Rate (CUR) and Information Gap (IG), the authors quantify each encoder's marginal contribution and the imbalance among encoders. They reveal pervasive redundancy: many tasks see little gain from extra encoders, while some encoders dominate specialized tasks such as OCR/Chart; larger models often exhibit greater utility disparity. Importantly, dual-encoder variants can achieve over 90% of full-model performance with substantially lower training and inference costs, challenging the assumption that more encoders are always better. The study provides a practical framework for designing more efficient MLLMs and highlights the trade-offs between accuracy, compute, and architectural complexity.

Abstract

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully and sometimes even improves when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoders marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90% of baseline on most non OCR tasks. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

TL;DR

This work investigates encoder redundancy in multimodal large language models that use multiple vision encoders. By systematically masking encoders and introducing the Conditional Utilization Rate (CUR) and Information Gap (IG), the authors quantify each encoder's marginal contribution and the imbalance among encoders. They reveal pervasive redundancy: many tasks see little gain from extra encoders, while some encoders dominate specialized tasks such as OCR/Chart; larger models often exhibit greater utility disparity. Importantly, dual-encoder variants can achieve over 90% of full-model performance with substantially lower training and inference costs, challenging the assumption that more encoders are always better. The study provides a practical framework for designing more efficient MLLMs and highlights the trade-offs between accuracy, compute, and architectural complexity.

Abstract

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully and sometimes even improves when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoders marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90% of baseline on most non OCR tasks. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.

Paper Structure

This paper contains 36 sections, 4 equations, 5 figures, 20 tables.

Figures (5)

  • Figure 1: An illustration of encoder redundancy. Different vision encoders provide similar or conflict visual cues, by ablating one or several of them the performance maintain or even improve.
  • Figure 2: Performance of multi-encoder MLLMs with different number of masked vision encoders. Max, Min and Mean refer to the subset of ablated encoders with best, worst and average performance among all possible subsets respectively.
  • Figure 3: CUR of different encoders across different category of benchmarks. A higher CUR means a larger dependence on specific encoders.
  • Figure 4: A show case of how Eagle-X4 8B Plus would behave with/without specific encoders masked.
  • Figure 5: A show case of how Eagle-X4 8B Plus would behave with/without specific encoders masked.