Table of Contents
Fetching ...

UNIV: Unified Foundation Model for Infrared and Visible Modalities

Fangyuan Mao, Shuo Wang, Jilin Mei, Shun Lu, Chen Min, Fuyang Liu, Xiaokun Feng, Meiqi Wu, Yu Hu

TL;DR

This work addresses cross-modal degradation when applying RGB foundation models to infrared data by identifying pattern shortcuts that degrade semantic understanding. It introduces UNIV, a unified RGB–IR foundation model built on Patch Cross-modal Contrastive Learning (PCCL), which uses semantic anchors derived from a frozen RGB encoder and a new cross-modal training objective to align infrared and visible representations in a single space. The authors also provide the Multi-scene Visible-Infrared Pair (MVIP) dataset, enabling robust cross-modal research with 98,992 aligned IR–RGB pairs. Empirical results show state-of-the-art infrared performance in semantic segmentation and object detection (+1.7 mIoU and +0.7 mAP, respectively) while largely preserving RGB downstream accuracy, demonstrating the practical potential of a unified cross-modal representation for robust perception across diverse environments.

Abstract

Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.

UNIV: Unified Foundation Model for Infrared and Visible Modalities

TL;DR

This work addresses cross-modal degradation when applying RGB foundation models to infrared data by identifying pattern shortcuts that degrade semantic understanding. It introduces UNIV, a unified RGB–IR foundation model built on Patch Cross-modal Contrastive Learning (PCCL), which uses semantic anchors derived from a frozen RGB encoder and a new cross-modal training objective to align infrared and visible representations in a single space. The authors also provide the Multi-scene Visible-Infrared Pair (MVIP) dataset, enabling robust cross-modal research with 98,992 aligned IR–RGB pairs. Empirical results show state-of-the-art infrared performance in semantic segmentation and object detection (+1.7 mIoU and +0.7 mAP, respectively) while largely preserving RGB downstream accuracy, demonstrating the practical potential of a unified cross-modal representation for robust perception across diverse environments.

Abstract

Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.

Paper Structure

This paper contains 26 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Visualization of attention maps from the final layer of different pre-trained models for a specific patch (marked with a star). (a) and (c) show a cyclist and a vehicle in the infrared modality, while (b) and (d) show the same objects in the visible RGB modality. Our UNIV establishes excellent cross-modal alignment and inter-class separability, enabling consistent attention localization for focused regions (marked with white boxes) at the patch level. mIoU denotes mIoU scores for infrared (IR) and visible (RGB) semantic segmentation on the MSRS Tang2022PIAFusion.
  • Figure 2: Proposed Unified Feature Space.UNIV learns a semantically rich and robust feature mapping by promoting intra-class cross-modal alignment and enhancing inter-class separation.
  • Figure 3: Overview of the proposed UNIV. Attention maps from a frozen visible RGB backbone are transformed to similarity pseudo-labels $\boldsymbol{M}^{A}$. The cross-modality similarity map $\boldsymbol{M}^{IA}$ and visible modality similarity map $\boldsymbol{M}^{VA}$ are optimized to align with $\boldsymbol{M}^{A}$.
  • Figure 4: Cross-modal Aligment. Our model achieves excellent cross-modal alignment with efficient inter-class separability.
  • Figure 5: Inter-class Separability. Our model exhibits better inter-class separability.
  • ...and 1 more figures