Table of Contents
Fetching ...

Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model

Yijia Chen, Pinghua Chen, Xiangxin Zhou, Yingtie Lei, Ziyang Zhou, Mingxian Li

TL;DR

This work tackles the challenge of translating low-contrast visible images into high-contrast infrared representations without the heavy cost of infrared sensors. It proposes a lightweight Transformer-based VIS→IR framework that fuses visible texture and color cues into the infrared domain via a Color Perception Adapter (CPA), Enhanced Feature Mapping Module (EFM), Dynamic Fusion Aggregation (DFA), and Enhanced Perception Attention (EPA), followed by global-context refinement with a Transformer. A dual loss combining $L_{smooth}$ and $L_{SSIM}$ guides training, and extensive experiments on five diverse datasets demonstrate superior quantitative and qualitative performance with minimal parameter overhead. The results also show improved downstream applicability for tasks like pedestrian detection, underscoring the practical impact for safety-critical applications in autonomous driving and robotics.

Abstract

In the field of computer vision, visible light images often exhibit low contrast in low-light conditions, presenting a significant challenge. While infrared imagery provides a potential solution, its utilization entails high costs and practical limitations. Recent advancements in deep learning, particularly the deployment of Generative Adversarial Networks (GANs), have facilitated the transformation of visible light images to infrared images. However, these methods often experience unstable training phases and may produce suboptimal outputs. To address these issues, we propose a novel end-to-end Transformer-based model that efficiently converts visible light images into high-fidelity infrared images. Initially, the Texture Mapping Module and Color Perception Adapter collaborate to extract texture and color features from the visible light image. The Dynamic Fusion Aggregation Module subsequently integrates these features. Finally, the transformation into an infrared image is refined through the synergistic action of the Color Perception Adapter and the Enhanced Perception Attention mechanism. Comprehensive benchmarking experiments confirm that our model outperforms existing methods, producing infrared images of markedly superior quality, both qualitatively and quantitatively. Furthermore, the proposed model enables more effective downstream applications for infrared images than other methods.

Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model

TL;DR

This work tackles the challenge of translating low-contrast visible images into high-contrast infrared representations without the heavy cost of infrared sensors. It proposes a lightweight Transformer-based VIS→IR framework that fuses visible texture and color cues into the infrared domain via a Color Perception Adapter (CPA), Enhanced Feature Mapping Module (EFM), Dynamic Fusion Aggregation (DFA), and Enhanced Perception Attention (EPA), followed by global-context refinement with a Transformer. A dual loss combining and guides training, and extensive experiments on five diverse datasets demonstrate superior quantitative and qualitative performance with minimal parameter overhead. The results also show improved downstream applicability for tasks like pedestrian detection, underscoring the practical impact for safety-critical applications in autonomous driving and robotics.

Abstract

In the field of computer vision, visible light images often exhibit low contrast in low-light conditions, presenting a significant challenge. While infrared imagery provides a potential solution, its utilization entails high costs and practical limitations. Recent advancements in deep learning, particularly the deployment of Generative Adversarial Networks (GANs), have facilitated the transformation of visible light images to infrared images. However, these methods often experience unstable training phases and may produce suboptimal outputs. To address these issues, we propose a novel end-to-end Transformer-based model that efficiently converts visible light images into high-fidelity infrared images. Initially, the Texture Mapping Module and Color Perception Adapter collaborate to extract texture and color features from the visible light image. The Dynamic Fusion Aggregation Module subsequently integrates these features. Finally, the transformation into an infrared image is refined through the synergistic action of the Color Perception Adapter and the Enhanced Perception Attention mechanism. Comprehensive benchmarking experiments confirm that our model outperforms existing methods, producing infrared images of markedly superior quality, both qualitatively and quantitatively. Furthermore, the proposed model enables more effective downstream applications for infrared images than other methods.
Paper Structure (24 sections, 7 equations, 4 figures, 3 tables)

This paper contains 24 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The comparison of various image translation techniques for Visible-to-Infrared image translation is illustrated as follows: (a) presents the original visible light image; (b) - (e) depict the translated images as produced by CycleGAN, MUNIT, BCI, and ClawGAN, respectively; (f) showcases the reference thermal image used as the translation target.
  • Figure 2: The overall architecture of the model. The image is processed through three parallel modules: Convolution (Conv), Color Perception Adapter (CPA), and Enhance Feature Mapping Module (EFM), which are responsible for extracting detail, color, and general convolutional features, respectively. These features are then amalgamated into a latent representation using the Dynamic Fusion Aggregation Module. Further, the CPA and the Enhanced Perception Attention mechanism act to transform these features to closely resemble those of an infrared image. Finally, a transformer module integrates global contextual information, serving to refine the final image output.
  • Figure 3: The visual comparison across five datasets—LLVIP, RoadScene, M3FD, FLIR, and MCubeS—highlights the performance from top to bottom.
  • Figure 4: In comparing detection performances for downstream applications, it has been noted that the infrared images generated by CycleGAN, MUNIT, ClawGAN, and BCI are frequently marred by the inclusion of irrelevant human silhouettes. The introduction of these artifacts can lead to undetected objects and a higher incidence of false detection.