Table of Contents
Fetching ...

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

TL;DR

LEO tackles the challenge of integrating multiple vision encoders within multimodal LLMs by introducing a dual-branch vision-encoder framework and a tile-level post-adaptation fusion strategy. The model processes high-resolution images via dynamic tiling, reduces visual tokens with pixel unshuffle, and interleaves per-tile visual tokens from two encoders before LLM processing, trained in a two-stage regimen. Across 13 benchmarks, LEO achieves state-of-the-art results on the majority of tasks, including OCR and chart understanding, and demonstrates robust domain transfer to autonomous driving without architectural changes. The work highlights the practical viability of hybrid vision encoders in MLLMs and provides a strong foundation for future domain-specific adaptations and scalable fusion strategies, with code and models to be released publicly.

Abstract

Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

TL;DR

LEO tackles the challenge of integrating multiple vision encoders within multimodal LLMs by introducing a dual-branch vision-encoder framework and a tile-level post-adaptation fusion strategy. The model processes high-resolution images via dynamic tiling, reduces visual tokens with pixel unshuffle, and interleaves per-tile visual tokens from two encoders before LLM processing, trained in a two-stage regimen. Across 13 benchmarks, LEO achieves state-of-the-art results on the majority of tasks, including OCR and chart understanding, and demonstrates robust domain transfer to autonomous driving without architectural changes. The work highlights the practical viability of hybrid vision encoders in MLLMs and provides a strong foundation for future domain-specific adaptations and scalable fusion strategies, with code and models to be released publicly.

Abstract

Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.
Paper Structure (17 sections, 5 figures, 7 tables)

This paper contains 17 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of the performance of LEO across diverse vision-language tasks with recent approaches chen2024internvlbai2023qwenliu2024improvedfan2024mousiluo2024feast.
  • Figure 2: Top: Comparison between the fusion strategy of existing hybrid MLLMs and that of LEO. Bottom: The most common fusion paradigms in the literature: (1) channel concatenation shi2024eagle, (2) sequence concatenation kar2024brave, (3) MR-adapter luo2024feast, and (4) cross-attention li2024mini.
  • Figure 3: The architecture of our model. LEO adapts a dual-vision MLLM architecture through tile-level post-adaptation fusion of visual tokens. Pixel unshuffle is adapted to decrease the visual token quantity.
  • Figure 4: Tile Segmentation: Each input image is divided into multiple tiles to capture localized details, while a resized version maintains global context. The tiles are shown after preprocessing with the SAM kirillov2023segment preprocessor.
  • Figure 5: Qualitative results of LEO's enhanced visual understanding on various vision-language tasks. Some images are taken from the following benchamrks: MMVet yu2023mmvet, MMMU yue2024mmmu, TextVQA singh2019towards-textvqa, and LingoQA marcu2312lingoqa