UNIV: Unified Foundation Model for Infrared and Visible Modalities
Fangyuan Mao, Shuo Wang, Jilin Mei, Shun Lu, Chen Min, Fuyang Liu, Xiaokun Feng, Meiqi Wu, Yu Hu
TL;DR
This work addresses cross-modal degradation when applying RGB foundation models to infrared data by identifying pattern shortcuts that degrade semantic understanding. It introduces UNIV, a unified RGB–IR foundation model built on Patch Cross-modal Contrastive Learning (PCCL), which uses semantic anchors derived from a frozen RGB encoder and a new cross-modal training objective to align infrared and visible representations in a single space. The authors also provide the Multi-scene Visible-Infrared Pair (MVIP) dataset, enabling robust cross-modal research with 98,992 aligned IR–RGB pairs. Empirical results show state-of-the-art infrared performance in semantic segmentation and object detection (+1.7 mIoU and +0.7 mAP, respectively) while largely preserving RGB downstream accuracy, demonstrating the practical potential of a unified cross-modal representation for robust perception across diverse environments.
Abstract
Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.
