Table of Contents
Fetching ...

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang

TL;DR

CoMP tackles the challenge of enabling Vision Foundation Models to process images at native resolutions while reducing the modality gap to language models. It introduces C-RoPE, a Continual Rotary Position Embedding that supports arbitrary resolutions, and Alignment Loss, which aligns VFM representations with LLM language space using language prototypes and Sinkhorn-Knopp normalization. The method is trained in three stages to gradually adapt VFMs and align them with language, yielding strong gains on multimodal benchmarks and preserving performance on classification and segmentation tasks. Empirical results show state-of-the-art performance for CoMP across 1B and 7B model regimes, with notable gains on tasks like ChartQA and ADE20K, and with single figures such as 64.9 on ChartQA (0.5B LLM) and 87.3% ImageNet-1K accuracy for CoMP-AIMv2. Overall, CoMP demonstrates that continual multimodal pre-training with native-resolution adaptation and cross-modal alignment can significantly enhance vision-language understanding without sacrificing traditional vision capabilities.

Abstract

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks. Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM, while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20K under frozen chunk evaluation.

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

TL;DR

CoMP tackles the challenge of enabling Vision Foundation Models to process images at native resolutions while reducing the modality gap to language models. It introduces C-RoPE, a Continual Rotary Position Embedding that supports arbitrary resolutions, and Alignment Loss, which aligns VFM representations with LLM language space using language prototypes and Sinkhorn-Knopp normalization. The method is trained in three stages to gradually adapt VFMs and align them with language, yielding strong gains on multimodal benchmarks and preserving performance on classification and segmentation tasks. Empirical results show state-of-the-art performance for CoMP across 1B and 7B model regimes, with notable gains on tasks like ChartQA and ADE20K, and with single figures such as 64.9 on ChartQA (0.5B LLM) and 87.3% ImageNet-1K accuracy for CoMP-AIMv2. Overall, CoMP demonstrates that continual multimodal pre-training with native-resolution adaptation and cross-modal alignment can significantly enhance vision-language understanding without sacrificing traditional vision capabilities.

Abstract

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks. Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM, while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20K under frozen chunk evaluation.

Paper Structure

This paper contains 17 sections, 9 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Overview of CoMP. Our method accepts an image at native resolution and its corresponding text. Then, in addition to training through text decoding in next-token prediction paradigm, we also explicitly project the visual features into the language space of LLM using Alignment Loss.
  • Figure 2: Left: C-RoPE. For ease of visualization, the projection layers $\mathcal{P}roj_{q,k,v,o}$ and scale operators are omitted. We leverage both absolute learned position embedding and relative RoPE-2D rope2d to capture richer positional information. Right: Alignment Loss. We illustrate it in the case of one single pair of global vision and text features $\mathbf{F}_v$ and $\mathbf{F}_t$ for simplicity. $\mathbf{F}_v$ and $\mathbf{F}_t$ are mapped by frozen learned prototype $\mathbf{W}$, i.e., the word embedding of LLMs. Then, they are converted into normalized probabilities using the Softmax function and iterative Sinkhorn-Knopp algorithm sk, respectively. Finally, cross-entropy is applied as the loss. To prevent information leakage, the text features are extracted without image prefixes.
  • Figure 3: Varying the image resolution during inference. We investigate the impact of image resolution on DocVQA docvqa and ChartQA chartqa by our CoMP-MM-1B.