Table of Contents
Fetching ...

Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation

Youngwan Jin, Incheol Park, Hanbin Song, Hyeongjin Ju, Yagiz Nalcakan, Shiho Kim

TL;DR

Pix2Next tackles the challenge of generating high-quality NIR images from RGB inputs under limited NIR data by integrating a Vision Foundation Model as a global feature extractor with cross-attention in an encoder--decoder generator. A multi-scale PatchGAN discriminator and a combined loss (GAN, SSIM, and feature matching) drive spectral-preserving translations, achieving state-of-the-art results on RANUS and IDD-AW and demonstrating practical benefits by augmenting NIR data for downstream object detection. The approach also shows strong LWIR translation performance on the FLIR dataset, indicating broader multispectral translation potential. These contributions enable scalable, high-fidelity NIR data generation, with implications for robust NIR-based computer vision in challenging conditions.

Abstract

This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next's advantages in quantitative metrics and visual quality, improving the FID score by 34.81% compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.

Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation

TL;DR

Pix2Next tackles the challenge of generating high-quality NIR images from RGB inputs under limited NIR data by integrating a Vision Foundation Model as a global feature extractor with cross-attention in an encoder--decoder generator. A multi-scale PatchGAN discriminator and a combined loss (GAN, SSIM, and feature matching) drive spectral-preserving translations, achieving state-of-the-art results on RANUS and IDD-AW and demonstrating practical benefits by augmenting NIR data for downstream object detection. The approach also shows strong LWIR translation performance on the FLIR dataset, indicating broader multispectral translation potential. These contributions enable scalable, high-fidelity NIR data generation, with implications for robust NIR-based computer vision in challenging conditions.

Abstract

This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next's advantages in quantitative metrics and visual quality, improving the FID score by 34.81% compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.
Paper Structure (27 sections, 7 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 7 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: The top row (a, c, d) presents outputs from the RGB camera, while the bottom row (b, d, f) displays the corresponding NIR images. Objects (house (in b), pedestrian (in d), and car (in f)) that are not clearly discernible in the RGB images are distinctly visible in the NIR domain. (infinity)
  • Figure 2: Comparison and distribution of publicly available autonomous driving-based RGB vs NIR datasets
  • Figure 3: Overall architecture of the Pix2Next method. The Generator and Discriminator architectures are primarily based on the Pix2pixHD framework. However, to achieve fine-grained scene representation, we integrated an Extractor module with cross-attention mechanisms applied to various layers of the Generator.
  • Figure 4: Example of RGB to NIR generation using the proposed method
  • Figure 5: Diagram of the electromagnetic spectrum focusing on the infrared range
  • ...and 8 more figures