Table of Contents
Fetching ...

IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

Yaming Zhang, Chenqiang Gao, Fangcen Liu, Junjie Guo, Lan Wang, Xinggan Peng, Deyu Meng

TL;DR

This work proposes IV-tuning, to parameter-efficiently harness PVMs for various IR-VIS downstream tasks, including salient object detection, semantic segmentation, and object detection, and exhibits superior generalization and scalability.

Abstract

Existing infrared and visible (IR-VIS) methods inherit the general representations of Pre-trained Visual Models (PVMs) to facilitate complementary learning. However, our analysis indicates that under the full fine-tuning paradigm, the feature space becomes highly constrained and low-ranked, which has been proven to seriously impair generalization. One remedy is to freeze the parameters, which preserves pretrained knowledge and helps maintain feature diversity. To this end, we propose IV-tuning, to parameter-efficiently harness PVMs for various IR-VIS downstream tasks, including salient object detection, semantic segmentation, and object detection. Extensive experiments across various settings demonstrate that IV-tuning outperforms previous state-of-the-art methods, and exhibits superior generalization and scalability. Remarkably, with only a single backbone, IV-tuning effectively facilitates the complementary learning of infrared and visible modalities with merely 3% trainable backbone parameters, and achieves superior computational efficiency compared to conventional IR-VIS paradigms.

IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

TL;DR

This work proposes IV-tuning, to parameter-efficiently harness PVMs for various IR-VIS downstream tasks, including salient object detection, semantic segmentation, and object detection, and exhibits superior generalization and scalability.

Abstract

Existing infrared and visible (IR-VIS) methods inherit the general representations of Pre-trained Visual Models (PVMs) to facilitate complementary learning. However, our analysis indicates that under the full fine-tuning paradigm, the feature space becomes highly constrained and low-ranked, which has been proven to seriously impair generalization. One remedy is to freeze the parameters, which preserves pretrained knowledge and helps maintain feature diversity. To this end, we propose IV-tuning, to parameter-efficiently harness PVMs for various IR-VIS downstream tasks, including salient object detection, semantic segmentation, and object detection. Extensive experiments across various settings demonstrate that IV-tuning outperforms previous state-of-the-art methods, and exhibits superior generalization and scalability. Remarkably, with only a single backbone, IV-tuning effectively facilitates the complementary learning of infrared and visible modalities with merely 3% trainable backbone parameters, and achieves superior computational efficiency compared to conventional IR-VIS paradigms.

Paper Structure

This paper contains 19 sections, 6 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: (a) top: Existing infrared-visible (IR-VIS) methods typically extend Pre-trained Visual Models (PVMs) into a dual-branch network and perform full fine-tuning. (a) bottom: We propose a streamlined paradigm where the utilization of infrared can be greatly simplified. (b) Fully fine-tuning the PVM leads to overfitting on background regions (as evidenced in Fig. \ref{['fig:pca']}), whereas our method alleviates overfitting and learns the complementarity between modalities effectively. (c) Our model demonstrates superior performance and generality with fewer trainable parameters.
  • Figure 2: Observation of effective information contained in the feature space via Principal Component Analysis (PCA). We apply PCA for dimension reduction and visualize the explainable variance ratio of principal components with high contributions. We track the first five principal components across layers of the EVA02-L fang2024eva + Segformer xie2021segformer model and illustrate how the feature space evolves as depth increases. We show that: (a) fully fine-tuned model quickly converges to a highly constrained and low-ranked subspace in higher layers, thereby sacrificing generalization ability. (b) frozen model maintains feature diversity but struggles to extract task-specific discriminative information. In contrast, our model (c) not only preserves the feature diversity of subspaces but also effectively excavates task-specific discriminative information.
  • Figure 3: Analysis of energy distribution between infrared and visible modalities, and frequency spectral patterns under varying conditions, where the center represents low frequency and the corners depict high frequency. We calculate the statistical average distribution of energy on the MFNet ha2017mfnet training set. We show that (1) compared with visible image, infrared modality exhibits a stronger low-frequency response, while their mid-to-high-frequency share certain similarities with those of visible modality. (2) convolutional layer enhances high‑frequency details—boosting texture and edge information in visible images. However, it causes the loss of low-frequency in the infrared modality, which are critical for complementary learning. In contrast, simple linear projection can effectively capture low-frequency signals in the infrared modality, which inspires the motivation behind our design.
  • Figure 4: The overview of the proposed IV-tuning. IV-tuning freezes the $L$-layer transformer-based backbone and only fine-tunes a select few modules to learn effective visual prompts. The initial prompt $\bm{\mathcal{P}}^{0}$ is generated by MP-$\alpha$ block and sent to each MP-$\beta$ block, with each encoder's output updating the initial prompt $\bm{\mathcal{P}}^{0}$.
  • Figure 5: Detailed design of the Split-Fuse Enhancer. The input tokens are reshaped to feature maps, then a depth-wise $3\times3$ convolution is applied to selected channels with residual connection, followed by two $1\times1$ convolutional layers with Batch Normalization (BN) and ReLU activation in between, where the output maintains the same dimension as the input. We set the split ratio $\frac{1}{r}$ as $\frac{1}{4}$ by default.
  • ...and 5 more figures