Table of Contents
Fetching ...

InfMAE: A Foundation Model in the Infrared Modality

Fangcen Liu, Chenqiang Gao, Yaming Zhang, Junjie Guo, Jinhao Wang, Deyu Meng

TL;DR

This paper proposes InfMAE, a foundation model in infrared modality, and designs an information-aware masking strategy, which is suitable for infrared images, which outperforms other supervised methods and self-supervised learning methods in three downstream tasks.

Abstract

In recent years, the foundation models have swept the computer vision field and facilitated the development of various tasks within different modalities. However, it remains an open question on how to design an infrared foundation model. In this paper, we propose InfMAE, a foundation model in infrared modality. We release an infrared dataset, called Inf30 to address the problem of lacking large-scale data for self-supervised learning in the infrared vision community. Besides, we design an information-aware masking strategy, which is suitable for infrared images. This masking strategy allows for a greater emphasis on the regions with richer information in infrared images during the self-supervised learning process, which is conducive to learning the generalized representation. In addition, we adopt a multi-scale encoder to enhance the performance of the pre-trained encoders in downstream tasks. Finally, based on the fact that infrared images do not have a lot of details and texture information, we design an infrared decoder module, which further improves the performance of downstream tasks. Extensive experiments show that our proposed method InfMAE outperforms other supervised methods and self-supervised learning methods in three downstream tasks.

InfMAE: A Foundation Model in the Infrared Modality

TL;DR

This paper proposes InfMAE, a foundation model in infrared modality, and designs an information-aware masking strategy, which is suitable for infrared images, which outperforms other supervised methods and self-supervised learning methods in three downstream tasks.

Abstract

In recent years, the foundation models have swept the computer vision field and facilitated the development of various tasks within different modalities. However, it remains an open question on how to design an infrared foundation model. In this paper, we propose InfMAE, a foundation model in infrared modality. We release an infrared dataset, called Inf30 to address the problem of lacking large-scale data for self-supervised learning in the infrared vision community. Besides, we design an information-aware masking strategy, which is suitable for infrared images. This masking strategy allows for a greater emphasis on the regions with richer information in infrared images during the self-supervised learning process, which is conducive to learning the generalized representation. In addition, we adopt a multi-scale encoder to enhance the performance of the pre-trained encoders in downstream tasks. Finally, based on the fact that infrared images do not have a lot of details and texture information, we design an infrared decoder module, which further improves the performance of downstream tasks. Extensive experiments show that our proposed method InfMAE outperforms other supervised methods and self-supervised learning methods in three downstream tasks.
Paper Structure (33 sections, 2 equations, 3 figures, 8 tables)

This paper contains 33 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Visual comparison of infrared and visible images. Compared to visible images, infrared images display diminished informational content due to their inherent lack of rich texture and color details. For example, objects such as zebra crossings, telephone poles, and road signs in the red box are often submerged in their surroundings due to their similar temperature to the surrounding environment.
  • Figure 2: Some samples of the Inf30. The environments in the collected dataset encompass skies, seascapes, forests, urban areas, suburban areas, lawns, and so on. The objects include ships, vehicles, pedestrians, public facilities, residential buildings, and so on.
  • Figure 3: The framework of the proposed InfMAE. It contains three modules: the mask block generation module, the multi-scale encoder module, and the infrared decoder module. The features connected by the red dashed lines in the multi-scale encoder module are fed into the infrared decoder module.