Table of Contents
Fetching ...

UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, Chunhong Pan

TL;DR

The paper addresses the challenge of transferring pre-trained knowledge to infrared semantic segmentation under a large domain gap from RGB data. It conducts a comprehensive benchmark of six pre-training methods, analyzes pre-trained attention patterns and the role of texture bias, and introduces UNIP, a unified infrared pre-training framework with NMI-HAD, InfMix, and LL-FPN. The results show up to 13.5% average mIoU gains and substantial efficiency advantages, with UNIP-S approaching MAE-L performance at a fraction of the computational cost. The study provides practical insights into cross-domain pre-training and demonstrates a path toward extending the approach to RGB, depth, and other modalities.

Abstract

Pre-training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre-training (e.g. RGB) and fine-tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre-training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre-training framework, to enhance the pre-trained model performance. This framework uses the hybrid-attention distillation NMI-HAD as the pre-training target, a large-scale mixed dataset InfMix for pre-training, and a last-layer feature pyramid network LL-FPN for fine-tuning. Experimental results show that UNIP outperforms various pre-training methods by up to 13.5\% in average mIoU on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics. UNIP-S achieves performance on par with MAE-L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state-of-the-art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at https://github.com/casiatao/UNIP.

UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

TL;DR

The paper addresses the challenge of transferring pre-trained knowledge to infrared semantic segmentation under a large domain gap from RGB data. It conducts a comprehensive benchmark of six pre-training methods, analyzes pre-trained attention patterns and the role of texture bias, and introduces UNIP, a unified infrared pre-training framework with NMI-HAD, InfMix, and LL-FPN. The results show up to 13.5% average mIoU gains and substantial efficiency advantages, with UNIP-S approaching MAE-L performance at a fraction of the computational cost. The study provides practical insights into cross-domain pre-training and demonstrates a path toward extending the approach to RGB, depth, and other modalities.

Abstract

Pre-training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre-training (e.g. RGB) and fine-tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre-training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre-training framework, to enhance the pre-trained model performance. This framework uses the hybrid-attention distillation NMI-HAD as the pre-training target, a large-scale mixed dataset InfMix for pre-training, and a last-layer feature pyramid network LL-FPN for fine-tuning. Experimental results show that UNIP outperforms various pre-training methods by up to 13.5\% in average mIoU on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics. UNIP-S achieves performance on par with MAE-L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state-of-the-art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at https://github.com/casiatao/UNIP.

Paper Structure

This paper contains 28 sections, 16 equations, 13 figures, 21 tables.

Figures (13)

  • Figure 1: The Chain-of-Thought (CoT) of our work. Step1 (Sec. \ref{['sec:benchmark']}): We benchmark the infrared segmentation performance of various pre-trained models and derive several insights. Step2 (Sec. \ref{['sec:investigation']}): We explore the reasons for the varying behaviors of these models by analyzing the pre-trained attention maps. Step3 (Sec. \ref{['sec:distill']}): Based on these findings, we propose UNIP, a unified framework aimed to enhance the performance of small pre-trained models, focusing on three aspects: the pre-training dataset (InfMix), the pre-training task (NMI-HAD), and the fine-tuning architecture (LL-FPN).
  • Figure 2: The performance of pre-trained models across various methods and sizes. Left: The average fine-tuning (FT) performance on three infrared semantic segmentation datasets, along with the associated computational cost. Middle: The average linear probing (LP) performance on three infrared datasets. Right: The fine-tuning performance on ImageNet imagenet. The gray dotted lines and corresponding values highlight the performance gains of UNIP over other methods. Detailed results for each dataset are presented in Tab. \ref{['tab:benchmark']}.
  • Figure 3: Attention maps for different query tokens in three representative layers. Each query token's attention map corresponds to a row in the attention matrix, averaged over different heads.
  • Figure 4: NMI on ImageNet.
  • Figure 5: The layerwise linear probing performance of different methods on SODA soda.
  • ...and 8 more figures