Table of Contents
Fetching ...

Cross-spectral Gated-RGB Stereo Depth Estimation

Samuel Brucker, Stefanie Walz, Mario Bijelic, Felix Heide

TL;DR

The paper tackles the challenge of achieving high-resolution, metric-depth maps at long range with cost-effective sensors, where gated depth methods previously lagged behind RGB imaging in resolution. It introduces a cross-spectral stereo network that fuses multi-view RCCB (high-resolution visible HDR) and gated NIR data, using a cross-spectral fusion module, PoseNet-based pose refinement, and attention-based feature fusion trained with self-supervised and LiDAR supervision. Key contributions include the cross-spectral depth estimation framework, a cross-modal stereo network, a training scheme that leverages both modalities, and a dataset with synchronized RCCB and gated views up to 220 m ground truth. The approach yields substantial improvements in MAE at long ranges and enables new applications such as long-distance small-object detection in autonomous driving, using economical CMOS sensors and active illumination.

Abstract

Gated cameras flood-illuminate a scene and capture the time-gated impulse response of a scene. By employing nanosecond-scale gates, existing sensors are capable of capturing mega-pixel gated images, delivering dense depth improving on today's LiDAR sensors in spatial resolution and depth precision. Although gated depth estimation methods deliver a million of depth estimates per frame, their resolution is still an order below existing RGB imaging methods. In this work, we combine high-resolution stereo HDR RCCB cameras with gated imaging, allowing us to exploit depth cues from active gating, multi-view RGB and multi-view NIR sensing -- multi-view and gated cues across the entire spectrum. The resulting capture system consists only of low-cost CMOS sensors and flood-illumination. We propose a novel stereo-depth estimation method that is capable of exploiting these multi-modal multi-view depth cues, including the active illumination that is measured by the RCCB camera when removing the IR-cut filter. The proposed method achieves accurate depth at long ranges, outperforming the next best existing method by 39% for ranges of 100 to 220m in MAE on accumulated LiDAR ground-truth. Our code, models and datasets are available at https://light.princeton.edu/gatedrccbstereo/ .

Cross-spectral Gated-RGB Stereo Depth Estimation

TL;DR

The paper tackles the challenge of achieving high-resolution, metric-depth maps at long range with cost-effective sensors, where gated depth methods previously lagged behind RGB imaging in resolution. It introduces a cross-spectral stereo network that fuses multi-view RCCB (high-resolution visible HDR) and gated NIR data, using a cross-spectral fusion module, PoseNet-based pose refinement, and attention-based feature fusion trained with self-supervised and LiDAR supervision. Key contributions include the cross-spectral depth estimation framework, a cross-modal stereo network, a training scheme that leverages both modalities, and a dataset with synchronized RCCB and gated views up to 220 m ground truth. The approach yields substantial improvements in MAE at long ranges and enables new applications such as long-distance small-object detection in autonomous driving, using economical CMOS sensors and active illumination.

Abstract

Gated cameras flood-illuminate a scene and capture the time-gated impulse response of a scene. By employing nanosecond-scale gates, existing sensors are capable of capturing mega-pixel gated images, delivering dense depth improving on today's LiDAR sensors in spatial resolution and depth precision. Although gated depth estimation methods deliver a million of depth estimates per frame, their resolution is still an order below existing RGB imaging methods. In this work, we combine high-resolution stereo HDR RCCB cameras with gated imaging, allowing us to exploit depth cues from active gating, multi-view RGB and multi-view NIR sensing -- multi-view and gated cues across the entire spectrum. The resulting capture system consists only of low-cost CMOS sensors and flood-illumination. We propose a novel stereo-depth estimation method that is capable of exploiting these multi-modal multi-view depth cues, including the active illumination that is measured by the RCCB camera when removing the IR-cut filter. The proposed method achieves accurate depth at long ranges, outperforming the next best existing method by 39% for ranges of 100 to 220m in MAE on accumulated LiDAR ground-truth. Our code, models and datasets are available at https://light.princeton.edu/gatedrccbstereo/ .
Paper Structure (10 sections, 12 equations, 7 figures, 4 tables)

This paper contains 10 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: RCCB cameras (top row) capture 8 Mpix passive RGB images. Gated cameras (bottom row) record Time-of-Flight data of a scene by combining active flash illumination and analog gated readout. Both sensors are complementary, with distinct strengths depending on the scenario. RCCB cameras excel in daylight (a) with high dynamic range, resolution and color. At night (b, c), gated images (gated slices here RGB-color coded by mapping each slice to one RGB color) provide strong depth cues and maintain consistent scene illumination through active illumination. This work integrates both modalities to estimate depth accurately in all ambient illumination conditions.
  • Figure 2: Cross-Spectral Matching (CSM). The layer fuses encoded features from RCCB ($F^c_l$) and gated ($F^g_l$) images. In the coarse registration step, RCCB features are aligned with gated features based on calibrated poses $X_{c \to g}$. Registration is refined based on residual pose $\hat{X}_{c|g \to g}$ estimated from coarse aligned images and measured time delta with PoseNet. Registered images are fused with attention-based fusion retaining complementary information in $\hat{F}$.
  • Figure 3: The proposed cross-spectral stereo architecture for depth estimation from stereo RCCB and stereo gated images incorporating our CSM layer. The network can output depth for all four input images. Intermediate depth estimates are used for iterative fusion within the CSM along the depth estimation process. The network is trained with self-supervision (Left-Right consistency for RCCB and gated images, Gated Reconstruction) and LiDAR supervision.
  • Figure 4: Depth estimation for "lost cargo", small objects at far distances on ground level that may be lost from preceding vehicles. Our method estimates accurate depth for these small objects in both daylight and nighttime conditions by integrating complementary RCCB and gated images. Single modality methods suffer from limitations: CREStereo liPracticalStereoMatching2022 (RCCB) lacks effective illumination at night, and Gated Stereo gatedstereo suffers from poor resolution during the day.
  • Figure 5: The sensor setup of the test vehicle used for capturing the dataset. It features a stereo gated camera, consisting of a flood-light flash source (not visible, mounted at front bumper of the car) and two gated imagers, a Velodyne VLS128 scanning lidar, a standard stereo RGB camera and the RCCB stereo camera.
  • ...and 2 more figures