Table of Contents
Fetching ...

Supervised Image Translation from Visible to Infrared Domain for Object Detection

Prahlad Anand, Qiranul Saadiyean, Aniruddh Sikdar, Nalini N, Suresh Sundaram

TL;DR

This paper tackles the domain gap between visible and infrared imagery for object detection by learning a supervised translation from visible to infrared using a two-stage GAN-based pipeline. It introduces a coarse-to-fine generator, multi-scale discriminators, and a stabilizing feature-matching objective, complemented by a potential super-resolution step to boost high-resolution translation quality. Training translated infrared data with standard detectors (e.g., Yolov5, Mask R-CNN, Faster R-CNN) improves downstream detection performance, including a reported up to 5.3% mAP gain. The approach demonstrates better generalization to real infrared data and offers a practical pathway to leverage abundant visible data for infrared-aware object detection in challenging environments.

Abstract

This study aims to learn a translation from visible to infrared imagery, bridging the domain gap between the two modalities so as to improve accuracy on downstream tasks including object detection. Previous approaches attempt to perform bi-domain feature fusion through iterative optimization or end-to-end deep convolutional networks. However, we pose the problem as similar to that of image translation, adopting a two-stage training strategy with a Generative Adversarial Network and an object detection model. The translation model learns a conversion that preserves the structural detail of visible images while preserving the texture and other characteristics of infrared images. Images so generated are used to train standard object detection frameworks including Yolov5, Mask and Faster RCNN. We also investigate the usefulness of integrating a super-resolution step into our pipeline to further improve model accuracy, and achieve an improvement of as high as 5.3% mAP.

Supervised Image Translation from Visible to Infrared Domain for Object Detection

TL;DR

This paper tackles the domain gap between visible and infrared imagery for object detection by learning a supervised translation from visible to infrared using a two-stage GAN-based pipeline. It introduces a coarse-to-fine generator, multi-scale discriminators, and a stabilizing feature-matching objective, complemented by a potential super-resolution step to boost high-resolution translation quality. Training translated infrared data with standard detectors (e.g., Yolov5, Mask R-CNN, Faster R-CNN) improves downstream detection performance, including a reported up to 5.3% mAP gain. The approach demonstrates better generalization to real infrared data and offers a practical pathway to leverage abundant visible data for infrared-aware object detection in challenging environments.

Abstract

This study aims to learn a translation from visible to infrared imagery, bridging the domain gap between the two modalities so as to improve accuracy on downstream tasks including object detection. Previous approaches attempt to perform bi-domain feature fusion through iterative optimization or end-to-end deep convolutional networks. However, we pose the problem as similar to that of image translation, adopting a two-stage training strategy with a Generative Adversarial Network and an object detection model. The translation model learns a conversion that preserves the structural detail of visible images while preserving the texture and other characteristics of infrared images. Images so generated are used to train standard object detection frameworks including Yolov5, Mask and Faster RCNN. We also investigate the usefulness of integrating a super-resolution step into our pipeline to further improve model accuracy, and achieve an improvement of as high as 5.3% mAP.
Paper Structure (8 sections, 5 equations, 3 figures)

This paper contains 8 sections, 5 equations, 3 figures.

Figures (3)

  • Figure 1: Network architecture of the generator. A residual network $G_{1}$ is first trained on lower resolution images. A second residual network $G_{2}$ is then appended to $G_{1}$ and the two networks are trained jointly on high-resolution images. Specifically, the input to the residual blocks in $G_{2}$ is the element-wise sum of the final feature map from $G_{1}$ and the feature map from $G_{2}$
  • Figure 2: Visualization of predictions for a subset of images from the M3FD dataset. The first row represents ground truth bounding boxes, while the second row and third row represent Yolov5s model predictions on the test set when trained on the source training infrared images and generated images respectively. Clearly, training on generated images results in higher confidence scores and lower rates of misclassification.
  • Figure 3: The confusion matrices for training and testing (only on source test set) for the M3FD dataset are similar, but notably, model performance is lower when training on generated IR but higher during testing. It is hypothesized that the generated infrared images, being marginally different from the source infrared images, force model generalization as compared to the potentially overfitted source-trained model.