Table of Contents
Fetching ...

Diff-Mosaic: Augmenting Realistic Representations in Infrared Small Target Detection via Diffusion Prior

Yukai Shi, Yupei Lin, Pengxu Wei, Xiaoyu Xian, Tianshui Chen, Liang Lin

TL;DR

This work tackles the limited realism and diversity of infrared small target augmentations by introducing Diff-Mosaic, a two-stage diffusion-prior data augmentation framework. The Pixel-Prior stage harmonizes mosaic composites into coherent, texture-consistent images, while the Diff-Prior stage uses a fine-tuned Latent Diffusion Model to resample these images with real-world texture and lighting, yielding highly realistic and varied samples. Across SIRST and NUDT-SIRST benchmarks, this approach improves IoU, Pd, and reduces Fa for multiple detectors, demonstrating superior robustness on challenging targets. The method offers a practical, label-free augmentation path that can enhance infrared small-target detection in real-world scenarios with scalable realism and diversity.

Abstract

Recently, researchers have proposed various deep learning methods to accurately detect infrared targets with the characteristics of indistinct shape and texture. Due to the limited variety of infrared datasets, training deep learning models with good generalization poses a challenge. To augment the infrared dataset, researchers employ data augmentation techniques, which often involve generating new images by combining images from different datasets. However, these methods are lacking in two respects. In terms of realism, the images generated by mixup-based methods lack realism and are difficult to effectively simulate complex real-world scenarios. In terms of diversity, compared with real-world scenes, borrowing knowledge from another dataset inherently has a limited diversity. Currently, the diffusion model stands out as an innovative generative approach. Large-scale trained diffusion models have a strong generative prior that enables real-world modeling of images to generate diverse and realistic images. In this paper, we propose Diff-Mosaic, a data augmentation method based on the diffusion model. This model effectively alleviates the challenge of diversity and realism of data augmentation methods via diffusion prior. Specifically, our method consists of two stages. Firstly, we introduce an enhancement network called Pixel-Prior, which generates highly coordinated and realistic Mosaic images by harmonizing pixels. In the second stage, we propose an image enhancement strategy named Diff-Prior. This strategy utilizes diffusion priors to model images in the real-world scene, further enhancing the diversity and realism of the images. Extensive experiments have demonstrated that our approach significantly improves the performance of the detection network. The code is available at https://github.com/YupeiLin2388/Diff-Mosaic

Diff-Mosaic: Augmenting Realistic Representations in Infrared Small Target Detection via Diffusion Prior

TL;DR

This work tackles the limited realism and diversity of infrared small target augmentations by introducing Diff-Mosaic, a two-stage diffusion-prior data augmentation framework. The Pixel-Prior stage harmonizes mosaic composites into coherent, texture-consistent images, while the Diff-Prior stage uses a fine-tuned Latent Diffusion Model to resample these images with real-world texture and lighting, yielding highly realistic and varied samples. Across SIRST and NUDT-SIRST benchmarks, this approach improves IoU, Pd, and reduces Fa for multiple detectors, demonstrating superior robustness on challenging targets. The method offers a practical, label-free augmentation path that can enhance infrared small-target detection in real-world scenarios with scalable realism and diversity.

Abstract

Recently, researchers have proposed various deep learning methods to accurately detect infrared targets with the characteristics of indistinct shape and texture. Due to the limited variety of infrared datasets, training deep learning models with good generalization poses a challenge. To augment the infrared dataset, researchers employ data augmentation techniques, which often involve generating new images by combining images from different datasets. However, these methods are lacking in two respects. In terms of realism, the images generated by mixup-based methods lack realism and are difficult to effectively simulate complex real-world scenarios. In terms of diversity, compared with real-world scenes, borrowing knowledge from another dataset inherently has a limited diversity. Currently, the diffusion model stands out as an innovative generative approach. Large-scale trained diffusion models have a strong generative prior that enables real-world modeling of images to generate diverse and realistic images. In this paper, we propose Diff-Mosaic, a data augmentation method based on the diffusion model. This model effectively alleviates the challenge of diversity and realism of data augmentation methods via diffusion prior. Specifically, our method consists of two stages. Firstly, we introduce an enhancement network called Pixel-Prior, which generates highly coordinated and realistic Mosaic images by harmonizing pixels. In the second stage, we propose an image enhancement strategy named Diff-Prior. This strategy utilizes diffusion priors to model images in the real-world scene, further enhancing the diversity and realism of the images. Extensive experiments have demonstrated that our approach significantly improves the performance of the detection network. The code is available at https://github.com/YupeiLin2388/Diff-Mosaic
Paper Structure (21 sections, 7 equations, 10 figures, 8 tables)

This paper contains 21 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: We compared the effects of the traditional Mosaic bochkovskiy2020yolov4 with our method. To emphasize the diversity of samples generated by our method, we fix the top-right image of Mosaic when combining rest four images. The samples generated by the Mosaic exhibit a fragmented quality and fail to resemble a complete image. In contrast, Diff-Mosaic has a uniform distribution and coordinated grayscale. Especially in terms of the infrared small targets marked with red circles, the results of Diff-Mosaic can better enhance the diversity and realism.
  • Figure 2: We illustrate the visual results of our approach compared to different SIRST detection methods on two SIRST datasets. The target region, Zoomed-In target region, False prediction region, and prediction different from Ground Truth (GT) are annotated with red dashed circles, red boxes, yellow dashed circles, and red pixels, respectively. Unlike other baselines that suffer from false alarms and discrepancies with the ground truth, our approach achieves accurate target detection without false alarms.
  • Figure 3: Framework overview of Diff-Mosaic. We show the workflow between our training stage and the data augmentation stage of Diff-Mosaic. During training, the image $I_{input}$ is subjected to a cut-and-paste operation to get $I'_{mix}$. Subsequently, $I'_{mix}$ is fed into the enhancement network to generate harmonized images $I'_{smooth}$. Finally, $I'_{smooth}$ is fed into the diffusion model for training to get the detail-rich image $I'_{realis}$. During the data generation stage, the Mosaic operation is applied to the image $I_{input}$ to yield the Mosaic image $I_{Mosaic}$. Subsequently, $I_{Mosaic}$ is input into the Pixel-Prior machine to generate $I_{smooth}$. Finally, by employing diffusion priors, realistic yet richer representations are integrated into the information of the image $I_{smooth}$ to generate more visually diverse and textured images $I_{realis}$.
  • Figure 4: Degrade & Cut-and-Paste process. The image $I_{input}$ is inputted into the degradation module to get $I'_{degrade}$. A random area $M$ is selected from $I_{input}$, cut out, and pasted onto the corresponding area in the degraded image $I'_{degrade}$ to generate the mixed image $I'_{mix}$.
  • Figure 5: Detection framework. We input the image $I_{realis}$ generated by Diff-Mosaic as an augmentation sample into the detection network for training. The backbone network consists of multiple sets of attention modules, each nested attention module consisting of two convolutional layers, a channel attention module, and a spatial attention module. The image input to the densely nested attention module generates features of different scales. Finally, these features are merged to produce the prediction result $\hat{M}$. Red circles denote regions where the target is located, and yellow circles denote regions that were incorrectly predicted. The model closes the gap between $\hat{M}$ and ground truth $M$ by $\mathcal{L}_{iou}$.
  • ...and 5 more figures