UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Tao Zhang; Jinyong Wen; Zhen Chen; Kun Ding; Shiming Xiang; Chunhong Pan

UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, Chunhong Pan

TL;DR

The paper addresses the challenge of transferring pre-trained knowledge to infrared semantic segmentation under a large domain gap from RGB data. It conducts a comprehensive benchmark of six pre-training methods, analyzes pre-trained attention patterns and the role of texture bias, and introduces UNIP, a unified infrared pre-training framework with NMI-HAD, InfMix, and LL-FPN. The results show up to 13.5% average mIoU gains and substantial efficiency advantages, with UNIP-S approaching MAE-L performance at a fraction of the computational cost. The study provides practical insights into cross-domain pre-training and demonstrates a path toward extending the approach to RGB, depth, and other modalities.

Abstract

Pre-training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre-training (e.g. RGB) and fine-tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre-training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre-training framework, to enhance the pre-trained model performance. This framework uses the hybrid-attention distillation NMI-HAD as the pre-training target, a large-scale mixed dataset InfMix for pre-training, and a last-layer feature pyramid network LL-FPN for fine-tuning. Experimental results show that UNIP outperforms various pre-training methods by up to 13.5\% in average mIoU on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics. UNIP-S achieves performance on par with MAE-L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state-of-the-art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at https://github.com/casiatao/UNIP.

UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

TL;DR

Abstract

UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)