Table of Contents
Fetching ...

Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion

Zhuokun Chen, Jinwu Hu, Zeshuai Deng, Yufeng Wang, Bohan Zhuang, Mingkui Tan

TL;DR

The paper addresses the high cost of enhancing visual perception in multimodal LLMs by proposing VisionFuse, a training-free framework that ensembles vision encoders from multiple MLLMs within a family and aligns them to a single LLM via delta-parameter merging. By concatenating vision tokens from multiple encoders and merging their LLM parameters, VisionFuse achieves improved multimodal reasoning without retraining. Empirical results across several benchmarks show consistent gains (notably around 4% average improvements when combining certain models), and token pruning offers a practical path to maintain efficiency with longer visual sequences. The work reveals three key insights: diverse attention regions across MLLMs, stronger feature alignment within an MLLM family, and the effectiveness of parameter merging for cross-encoder alignment, highlighting a scalable, hardware-friendly approach to boosting MLLM perception.

Abstract

Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.

Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion

TL;DR

The paper addresses the high cost of enhancing visual perception in multimodal LLMs by proposing VisionFuse, a training-free framework that ensembles vision encoders from multiple MLLMs within a family and aligns them to a single LLM via delta-parameter merging. By concatenating vision tokens from multiple encoders and merging their LLM parameters, VisionFuse achieves improved multimodal reasoning without retraining. Empirical results across several benchmarks show consistent gains (notably around 4% average improvements when combining certain models), and token pruning offers a practical path to maintain efficiency with longer visual sequences. The work reveals three key insights: diverse attention regions across MLLMs, stronger feature alignment within an MLLM family, and the effectiveness of parameter merging for cross-encoder alignment, highlighting a scalable, hardware-friendly approach to boosting MLLM perception.

Abstract

Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.

Paper Structure

This paper contains 20 sections, 5 equations, 17 figures, 8 tables, 1 algorithm.

Figures (17)

  • Figure 1: To enhance the perception capabilities of MLLMs, existing methods require substantial training costs to explore a vast design space and align each potential encoder with the language model. In contrast, our VisionFuse directly utilizes vision encoders from MLLMs within a family and aligns them with a single LLM by merging the parameters of LLMs, without incurring additional training overhead. For example, by integrating SLIME-8B and MGM-8B for free, VisionFuse well exceeds the individual MLLMs and other leading methods.
  • Figure 2: Different MLLMs exhibit varying visual perception capabilities. We visualize the average cross-attention maps across all layers for two MLLMs - MGM and SLM, as well as for our method that integrates these two models, using an example to observe which areas the models focus on. It shows that our VisionFuse attention is more accurate, integrating the perceptual abilities of both MGM and SLM. Here, "MGM" represents Mini-Gemini li2024mini, and "SLM" represents SLIME zhang2024beyond.
  • Figure 3: Overview of VisionFuse. VisionFuse merges the language model parameters from $M$ different pretrained MLLMs within a family to align a single language model with multiple vision encoders. The merged LLM is obtained by the weighted linear interpolation of the pretrained LLM and $M$ delta parameters, which are the changes in parameters of LLMs during fine-tuning. The input image is processed through distinct preprocessing pipelines, consistent with those in MLLMs, as well as vision encoders and projectors, to extract richer visual features. These features are then concatenated with text tokens and fed into the merged language model.
  • Figure 4: Summary of our exploration and observations: (a) demonstrates that different MLLMs focus on distinct image regions for the same visual and textual inputs; (b) reveals that vision encoders within an MLLM family exhibit more similar feature distributions; and (c) highlights the importance of merging language model parameters to align the language model with different vision encoders.
  • Figure 5: Comparision with directly duplicating original tokens many times.
  • ...and 12 more figures