Table of Contents
Fetching ...

UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning

Maoxun Yuan, Bo Cui, Tianyi Zhao, Jiayi Wang, Shan Fu, Xue Yang, Xingxing Wei

TL;DR

This work tackles the lack of scalable infrared pretraining for RGB-IR semantic tasks by proposing UniRGB-IR, a unified adapter-based framework that injects rich multi-modal features into a frozen Vision Transformer backbone. The method comprises a Multi-modal Feature Pool (MFP) that extracts contextual, multi-scale RGB/IR features and a Supplementary Feature Injector (SFI) that uses sparse attention and progressive fusion to inject these features into the ViT via adapters. By freezing the ViT and training only the MFP and SFI modules (adapter tuning), the approach preserves prior knowledge while achieving strong performance across RGB-IR object detection, semantic segmentation, and salient object detection, attaining state-of-the-art results on several benchmarks. This framework offers a scalable, parameter-efficient path for multi-modal fusion, with potential to generalize to broader RGB-IR and other multi-modal tasks.

Abstract

Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.

UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning

TL;DR

This work tackles the lack of scalable infrared pretraining for RGB-IR semantic tasks by proposing UniRGB-IR, a unified adapter-based framework that injects rich multi-modal features into a frozen Vision Transformer backbone. The method comprises a Multi-modal Feature Pool (MFP) that extracts contextual, multi-scale RGB/IR features and a Supplementary Feature Injector (SFI) that uses sparse attention and progressive fusion to inject these features into the ViT via adapters. By freezing the ViT and training only the MFP and SFI modules (adapter tuning), the approach preserves prior knowledge while achieving strong performance across RGB-IR object detection, semantic segmentation, and salient object detection, attaining state-of-the-art results on several benchmarks. This framework offers a scalable, parameter-efficient path for multi-modal fusion, with potential to generalize to broader RGB-IR and other multi-modal tasks.

Abstract

Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
Paper Structure (17 sections, 7 equations, 6 figures, 7 tables)

This paper contains 17 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Existing full fine-tuning methods vs. our UniRGB-IR framework. (a) Existing methods use pre-trained RGB-based foundation models and fully fine-tune them on their RGB-IR semantic relevance datasets. (b) We utilize the Adapter houlsby2019parameter to propose a unified framework, which can efficiently introduce richer RGB-IR features into the pre-trained foundation model for various semantic tasks.
  • Figure 2: The overall architecture of our UniRGB-IR. In our framework, a ViT model with different numbers of ViT blocks is deployed as a foundation model, which is divided into $N$ (usually $N=4$) stages for feature interaction. During training, we freeze the entire ViT model weights and only optimize the MFP and SFI modules.
  • Figure 3: Structure of the multi-modal feature pool (MFP) module. We explore multiple perceptions to expand the receptive field of contextual feature extraction and utilize the feature pyramid to obtain the multi-scale features.
  • Figure 4: Structure of supplementary feature injector (SFI) module. A gating network is utilized to dynamically fuse the current and the last injected features.
  • Figure 5: Visualization of intermediate results. The $\boldsymbol{F_{mfp}}$ and $\boldsymbol{F_{sfi}}$ features from the first stage are visualized in the third and fourth columns. The tSNE visualizations are also shown in the last two columns.
  • ...and 1 more figures