Table of Contents
Fetching ...

Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor

Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia

TL;DR

Depth on Demand (DoD) targets streaming dense depth by coupling a high-FPS RGB stream with a low-FPS, sparse active depth sensor and decoupling their frame rates via $ au = f_{ m D}/f_{ m RGB}$. The method deploys a three-stage pipeline—Multi-Modal Encoding, Iterative Multi-Modal Integration, and Depth Decoding—leveraging geometry cues, monocular context, and sparse depth updates through epipolar-aware features and iterative fusion to predict dense depth maps aligned to the RGB frames. Across indoor and outdoor benchmarks, DoD outperforms depth completion and traditional MVS baselines, achieving denser reconstructions with lower memory footprints and faster runtimes, and exhibits strong generalization to new datasets (e.g., Waymo). The work demonstrates practical impact for robotics and automotive perception by enabling energy-efficient, high-temporal-density depth sensing suitable for safety-critical applications, while noting moving objects as a remaining challenge and highlighting opportunities for further robustness enhancements.

Abstract

High frame rate and accurate depth estimation plays an important role in several tasks crucial to robotics and automotive perception. To date, this can be achieved through ToF and LiDAR devices for indoor and outdoor applications, respectively. However, their applicability is limited by low frame rate, energy consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate temporal and spatial depth densification achieved by exploiting a high frame rate RGB sensor coupled with a potentially lower frame rate and sparse active depth sensor. Our proposal jointly enables lower energy consumption and denser shape reconstruction, by significantly reducing the streaming requirements on the depth sensor thanks to its three core stages: i) multi-modal encoding, ii) iterative multi-modal integration, and iii) depth decoding. We present extended evidence assessing the effectiveness of DoD on indoor and outdoor video datasets, covering both environment scanning and automotive perception use cases.

Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor

TL;DR

Depth on Demand (DoD) targets streaming dense depth by coupling a high-FPS RGB stream with a low-FPS, sparse active depth sensor and decoupling their frame rates via . The method deploys a three-stage pipeline—Multi-Modal Encoding, Iterative Multi-Modal Integration, and Depth Decoding—leveraging geometry cues, monocular context, and sparse depth updates through epipolar-aware features and iterative fusion to predict dense depth maps aligned to the RGB frames. Across indoor and outdoor benchmarks, DoD outperforms depth completion and traditional MVS baselines, achieving denser reconstructions with lower memory footprints and faster runtimes, and exhibits strong generalization to new datasets (e.g., Waymo). The work demonstrates practical impact for robotics and automotive perception by enabling energy-efficient, high-temporal-density depth sensing suitable for safety-critical applications, while noting moving objects as a remaining challenge and highlighting opportunities for further robustness enhancements.

Abstract

High frame rate and accurate depth estimation plays an important role in several tasks crucial to robotics and automotive perception. To date, this can be achieved through ToF and LiDAR devices for indoor and outdoor applications, respectively. However, their applicability is limited by low frame rate, energy consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate temporal and spatial depth densification achieved by exploiting a high frame rate RGB sensor coupled with a potentially lower frame rate and sparse active depth sensor. Our proposal jointly enables lower energy consumption and denser shape reconstruction, by significantly reducing the streaming requirements on the depth sensor thanks to its three core stages: i) multi-modal encoding, ii) iterative multi-modal integration, and iii) depth decoding. We present extended evidence assessing the effectiveness of DoD on indoor and outdoor video datasets, covering both environment scanning and automotive perception use cases.
Paper Structure (25 sections, 4 equations, 13 figures, 7 tables)

This paper contains 25 sections, 4 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: 3D reconstruction with a low frame rate, sparse depth sensor. Running depth completion (a) on low FPS, sparse depth maps generate holes in the final reconstruction. Adding a higher FPS color camera allows for obtaining depth from Multi-View Stereo (b) or projecting depth to nearby color views and running completion (c), with unsatisfactory results. Our framework (d) performs temporal completion using two views and one sparse depth frame, yielding denser and more accurate meshes.
  • Figure 2: Temporal Depth Stream Densification Setup. On the left, an example of DoD applied to an indoor video sequence where only a few frames ( red views) are associated with sparse depth data. On the right, a close-up example of the supposed setup. Using an RGB-D video stream with only a few sparse depth frames requires the integration of monocular, multi-view, and sparse depth cues. Our framework smoothly enables the recovery of temporal and spatial depth resolution in such a scenario.
  • Figure 3: Depth on Demand Framework Overview. We provide a high-level overview of DoD, level-wise architectural details are provided in the supplementary material. DoD embeds multi-view cues and monocular features in the Visual Cues Integration, then integrates sparse depth updates in the Depth Cues Integration. To properly exploit both these information these stages are applied iteratively in the form of depth updates.
  • Figure 4: Qualitative results on ScanNetV2. On top: from left to right the source view with sparse depth points, the target view with projected sparse depth points, and predictions by competitors and DoD. At the bottom: reconstructed meshes by competitors and DoD, respectively at low and high temporal resolution.
  • Figure 5: Qualitative results on 7Scenes. From left to right: source view with sparse depth points, the target view with projected sparse depth points, and predictions by DoD and existing methods.
  • ...and 8 more figures