Table of Contents
Fetching ...

DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance

Yinghui Xing, Xiaoting Su, Shizhou Zhang, Donghao Chu, Di Xu

TL;DR

DuGI-MAE tackles infrared-specific representation learning by combining an entropy-based masking strategy with a Dual-Domain Guidance module that fuses spatial and adaptive frequency-domain cues. The approach is pretrained on Inf-590K, a large-scale infrared dataset, and demonstrates superior generalization across infrared object detection, semantic segmentation, and small-target detection compared to state-of-the-art self-supervised methods. The DDG module and entropy-based masking address information sparsity and non-uniform noise, yielding robust, transferable representations for infrared vision tasks. The work also provides a practical infrared pretraining dataset and shows the DDG framework can enhance other SSL baselines in infrared data scenarios.

Abstract

Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.

DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance

TL;DR

DuGI-MAE tackles infrared-specific representation learning by combining an entropy-based masking strategy with a Dual-Domain Guidance module that fuses spatial and adaptive frequency-domain cues. The approach is pretrained on Inf-590K, a large-scale infrared dataset, and demonstrates superior generalization across infrared object detection, semantic segmentation, and small-target detection compared to state-of-the-art self-supervised methods. The DDG module and entropy-based masking address information sparsity and non-uniform noise, yielding robust, transferable representations for infrared vision tasks. The work also provides a practical infrared pretraining dataset and shows the DDG framework can enhance other SSL baselines in infrared data scenarios.

Abstract

Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.

Paper Structure

This paper contains 16 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) Representative infrared image from a typical scene. Left: The original infrared image, where strong background responses often suppress the actual targets; Middle: Entropy map of the image; Right: Image processed with Adaptive Frequency-Domain Modulation (AFDM). (b) Comparison between Information-aware masking liu2024infmae and our Entropy-based masking.
  • Figure 2: Resolution distribution of the Inf-590K dataset. The horizontal and vertical axes represent image width and height, respectively, while the size of each bubble indicates the number of samples corresponding to that resolution.
  • Figure 3: Overall architecture of DuGI-MAE. It consists of the (a) Entropy-Based Masking Module, the (b) Encoder, the (c) Dual-Domain Guidance (DDG) module, and the (d) Decoder.
  • Figure 4: Adaptive Frequency-Domain Modulation (AFDM). The input images are first transformed into the frequency domain via the Fast Fourier Transform (FFT). A learnable radial filter is then applied to suppress non-uniform background noise(usually low-frequency components) while preserving discriminative features. Finally, the processed features are transformed back to the spatial domain using the Inverse FFT (IFFT).