CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Yitong Chen; Lingchen Meng; Wujian Peng; Zuxuan Wu; Yu-Gang Jiang

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang

TL;DR

CoMP tackles the challenge of enabling Vision Foundation Models to process images at native resolutions while reducing the modality gap to language models. It introduces C-RoPE, a Continual Rotary Position Embedding that supports arbitrary resolutions, and Alignment Loss, which aligns VFM representations with LLM language space using language prototypes and Sinkhorn-Knopp normalization. The method is trained in three stages to gradually adapt VFMs and align them with language, yielding strong gains on multimodal benchmarks and preserving performance on classification and segmentation tasks. Empirical results show state-of-the-art performance for CoMP across 1B and 7B model regimes, with notable gains on tasks like ChartQA and ADE20K, and with single figures such as 64.9 on ChartQA (0.5B LLM) and 87.3% ImageNet-1K accuracy for CoMP-AIMv2. Overall, CoMP demonstrates that continual multimodal pre-training with native-resolution adaptation and cross-modal alignment can significantly enhance vision-language understanding without sacrificing traditional vision capabilities.

Abstract

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks. Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM, while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20K under frozen chunk evaluation.

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

TL;DR

Abstract

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)