Table of Contents
Fetching ...

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision

Jinnyeong Kim, Seung-Hwan Baek

TL;DR

This work tackles robust 3D robot vision under challenging lighting by introducing a pixel-aligned RGB-NIR stereo system with an integrated LiDAR on a mobile robot. It provides real and synthetic pixel-aligned RGB-NIR stereo datasets, enabling learning-based fusion without depth-pose misalignment, and proposes two fusion pathways: an RGB-NIR image fusion method that can feed RGB-pretrained models directly and a feature-fusion depth estimation network built on RAFT-Stereo with cross-spectral attention. The methods show improved performance across depth estimation, object detection, and structure-from-motion under varying illumination, surpassing pixel-misaligned baselines and single-modality baselines. This work advances practical, cross-spectral 3D perception for robotics, offering data, models, and evaluation protocols that boost robustness in real-world environments. It also opens avenues for leveraging RGB-NIR priors in generative and multi-spectral perception for autonomous systems.

Abstract

Integrating RGB and NIR stereo imaging provides complementary spectral information, potentially enhancing robotic 3D vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream vision tasks. In this paper, we introduce a robotic vision system equipped with pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures pixel-aligned pairs of RGB stereo images, NIR stereo images, and temporally synchronized LiDAR points. Utilizing the mobility of the robot, we present a dataset containing continuous video frames under diverse lighting conditions. We then introduce two methods that utilize the pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information. Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision

TL;DR

This work tackles robust 3D robot vision under challenging lighting by introducing a pixel-aligned RGB-NIR stereo system with an integrated LiDAR on a mobile robot. It provides real and synthetic pixel-aligned RGB-NIR stereo datasets, enabling learning-based fusion without depth-pose misalignment, and proposes two fusion pathways: an RGB-NIR image fusion method that can feed RGB-pretrained models directly and a feature-fusion depth estimation network built on RAFT-Stereo with cross-spectral attention. The methods show improved performance across depth estimation, object detection, and structure-from-motion under varying illumination, surpassing pixel-misaligned baselines and single-modality baselines. This work advances practical, cross-spectral 3D perception for robotics, offering data, models, and evaluation protocols that boost robustness in real-world environments. It also opens avenues for leveraging RGB-NIR priors in generative and multi-spectral perception for autonomous systems.

Abstract

Integrating RGB and NIR stereo imaging provides complementary spectral information, potentially enhancing robotic 3D vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream vision tasks. In this paper, we introduce a robotic vision system equipped with pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures pixel-aligned pairs of RGB stereo images, NIR stereo images, and temporally synchronized LiDAR points. Utilizing the mobility of the robot, we present a dataset containing continuous video frames under diverse lighting conditions. We then introduce two methods that utilize the pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information. Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.

Paper Structure

This paper contains 26 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: RGB-NIR stereo imaging system. (a) Stereo camera setup integrated with LiDAR and active NIR illumination. (b) Spectral sensitivity profiles of the camera sensors and the irradiance profile of the NIR illumination. (c) Diagram of the pixel-aligned RGB-NIR camera featuring a dichroism prism, where RGB light penetrates the prism and NIR light is reflected, achieving spectral separation.
  • Figure 2: RGB-NIR stereo dataset. (a) Our dataset comprises stereo RGB image pairs, stereo NIR image pairs, and sparse depth point clouds, all captured from continuous video sequences. (b) A comparison of our dataset with other RGB-NIR image datasets, specifically curated for 3D vision tasks. (c) A quantitative analysis of our dataset is presented, illustrating the distribution of frames under varying lighting conditions, a histogram showing the number of frames per video sequence, and the frame exposure time distributions for RGB and NIR sensors under three distinct lighting scenarios.
  • Figure 3: RGB-NIR image fusion. We fuse the RGB image $I_{\text{RGB}}$ and the image $I_{\text{NIR}}$ as a weighted sum in the brightness domain $V$ after converting RGB image into HSV channel. The spatially-varying weights $\alpha,\beta$ are learned to effectively fuse RGB and NIR images. The fused image can be used as inputs to vision models such as object detection, stereo depth estimation, and structure from motion.
  • Figure 4: RGB-NIR stereo depth estimation model. We modified RAFT-Stereo lipson2021raft with attentional feature fusion and alternative correlation search for RGB-NIR depth estimation. We extract features from RGB and NIR images, fuse them, and build cost volumes. We estimate disparity by repeatedly feeding the cost volume of fused features and cost volume of NIR features to the GRU unit whose hidden state is initialized with $F_\text{fusion}^\text{left}$.
  • Figure 5: Image fusion for pretrained RAFT-stereo lipson2021raft. (a)&(b) Using a single modality either RGB or NIR images often results in sub-optimal disparity estimation in challenging lighting conditions. (c) Our RGB-NIR fused image enables robust disparity estimation without finetuning the pretrained model. (d) Closeups. (e) Ground-truth LiDAR sparse disparity.
  • ...and 3 more figures