Table of Contents
Fetching ...

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan

Abstract

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Abstract

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

Paper Structure

This paper contains 46 sections, 13 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10--23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo deitke2025molmo baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks.
  • Figure 2: Semantic feature analysis. (a) Layer-wise comparison of spatial attention in DINOv3 and SigLIP2. We visualize attention masks from four representative layers per model (early, early-mid, mid-late, final), showing the top 30% attention with contour overlays. (b) Layer-wise attention rollout visualization using Gard-CAM for the selective layers from Fig \ref{['fig:teaser']}. (Best view in zoom)
  • Figure 3: Overview of the proposed multi-encoder, multi-scale vision-language framework. Images are processed by two complementary vision encoders (SigLIP2 and DINOv3), with hierarchical features extracted across layers, fused via orthogonality-regularized mixing (OL), spatially aligned using RoPE-based cross-attention, and injected into a decoder-only language model for grounding and generation.
  • Figure 4: Qualitative results on PixMo pointing. Compared to prior VLMs, CoME-VL demonstrates more precise coordinate-level grounding for fine-grained visual queries.
  • Figure 5: Qualitative examples of CoME-VL on chart understanding, document/table reasoning, localization, pointing, and counting, demonstrating its ability to jointly support visual understanding and grounding.
  • ...and 5 more figures