Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation
Youngwan Jin, Incheol Park, Hanbin Song, Hyeongjin Ju, Yagiz Nalcakan, Shiho Kim
TL;DR
Pix2Next tackles the challenge of generating high-quality NIR images from RGB inputs under limited NIR data by integrating a Vision Foundation Model as a global feature extractor with cross-attention in an encoder--decoder generator. A multi-scale PatchGAN discriminator and a combined loss (GAN, SSIM, and feature matching) drive spectral-preserving translations, achieving state-of-the-art results on RANUS and IDD-AW and demonstrating practical benefits by augmenting NIR data for downstream object detection. The approach also shows strong LWIR translation performance on the FLIR dataset, indicating broader multispectral translation potential. These contributions enable scalable, high-fidelity NIR data generation, with implications for robust NIR-based computer vision in challenging conditions.
Abstract
This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next's advantages in quantitative metrics and visual quality, improving the FID score by 34.81% compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.
