Table of Contents
Fetching ...

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Matvey Skripkin, Elizaveta Goncharova, Dmitrii Tarasov, Andrey Kuznetsov

TL;DR

MOVE introduces a lightweight router-driven mixture-of-vision-encoders for domain-focused vision-language processing, routing each input to a single, domain-tuned encoder (InternViT, Texify, UniChart) to avoid context overload and expensive slicing. The architecture integrates an adapter per encoder and a routing mechanism trained on embeddings from a general encoder, feeding into a fixed LLM (Qwen2/Qwen2.5) for multimodal reasoning. Empirical results show MOVE achieves competitive performance across ChartQA, SQA, MMMU, MMBench, and related benchmarks with as few as 196–576 visual tokens, and improves over comparable approaches on several tasks while highlighting gaps in OCR-heavy domains. The work emphasizes efficiency and versatility, and suggests future expansion to include additional encoders (e.g., document OCR and medical imaging) and end-to-end training of the router and adapters, broadening domain coverage and applicability.

Abstract

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

TL;DR

MOVE introduces a lightweight router-driven mixture-of-vision-encoders for domain-focused vision-language processing, routing each input to a single, domain-tuned encoder (InternViT, Texify, UniChart) to avoid context overload and expensive slicing. The architecture integrates an adapter per encoder and a routing mechanism trained on embeddings from a general encoder, feeding into a fixed LLM (Qwen2/Qwen2.5) for multimodal reasoning. Empirical results show MOVE achieves competitive performance across ChartQA, SQA, MMMU, MMBench, and related benchmarks with as few as 196–576 visual tokens, and improves over comparable approaches on several tasks while highlighting gaps in OCR-heavy domains. The work emphasizes efficiency and versatility, and suggests future expansion to include additional encoders (e.g., document OCR and medical imaging) and end-to-end training of the router and adapters, broadening domain coverage and applicability.

Abstract

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

Paper Structure

This paper contains 21 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The MOVE architecture. The model consists of a large language model (LLM), multiple vision experts, a router for encoder selection, and adapter modules bridging visual and textual representations
  • Figure 2: Router pre-training stage
  • Figure 3: Pre-training stage of the MOVE
  • Figure 4: Supervised fine-tuning stage of the MOVE
  • Figure 5: Example highlighting MOVE's consistency with the ground truth in LaTeX code generation, compared to LLaVA-OneVision 7B
  • ...and 1 more figures