Table of Contents
Fetching ...

Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper, Ling Liu

Abstract

With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.

Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

Abstract

With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.
Paper Structure (20 sections, 8 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance of open-sourced and most popular VLMs in various tasks and their error correlation in MMMU.
  • Figure 2: The V3Fusion Framework: An overview.
  • Figure 3: (a) All candidate ensemble teams from the model pool are plotted with their Focal Diversity, Focal-CKA, and Plurality Voting metrics for MMMU dataset. (b) Scalability of the analysis stage with breakdown.
  • Figure 4: Four examples are used to illustrate the superior performance of V3Fusion compared to existing popular methods. It demonstrates cases where V3Fusion can achieve the correct output, even when the base models fail to reach a consensus or agree on an incorrect option.
  • Figure 5: We illustrate the threshold selection process and the resulting boost by the Vision Verification. The first figure on the left shows the average accuracy of the verified predictions for each dataset compared to the non-verified predictions. The second and third plots show the single- and two-component Gaussian fits to the Epistemic Uncertainty, the Algorithm \ref{['alg:adaptive']} selects threshold as $\tau=0.1315$.
  • ...and 3 more figures