Table of Contents
Fetching ...

Deep Cost Ray Fusion for Sparse Depth Video Completion

Jungeon Kim, Soongjin Kim, Jaesik Park, Seungyong Lee

TL;DR

This work tackles sparse depth video completion by introducing RayFusion, a cost-volume fusion framework that performs ray-wise attention over depth hypotheses to fuse sequential cost volumes from RGB-D video. The method constructs a 3D cost volume per frame, fuses it with the previous frame’s volume through self- and cross-attention on ray-wise features, and regresses depth from a probability volume followed by NLSPN refinement. Training uses a combination of $L_1$ depth loss and cross-entropy supervision on the probability distribution over depth planes, with soft labels derived from the two nearest planes. RayFusion delivers state-of-the-art or competitive results on KITTI, VOID, and ScanNetV2 with a substantially smaller parameter footprint (≈1.15M) and demonstrates robustness to varying sparsity and cross-dataset generalization, albeit with a noted high memory demand due to 3D convolutions and attention maps.

Abstract

In this paper, we present a learning-based framework for sparse depth video completion. Given a sparse depth map and a color image at a certain viewpoint, our approach makes a cost volume that is constructed on depth hypothesis planes. To effectively fuse sequential cost volumes of the multiple viewpoints for improved depth completion, we introduce a learning-based cost volume fusion framework, namely RayFusion, that effectively leverages the attention mechanism for each pair of overlapped rays in adjacent cost volumes. As a result of leveraging feature statistics accumulated over time, our proposed framework consistently outperforms or rivals state-of-the-art approaches on diverse indoor and outdoor datasets, including the KITTI Depth Completion benchmark, VOID Depth Completion benchmark, and ScanNetV2 dataset, using much fewer network parameters.

Deep Cost Ray Fusion for Sparse Depth Video Completion

TL;DR

This work tackles sparse depth video completion by introducing RayFusion, a cost-volume fusion framework that performs ray-wise attention over depth hypotheses to fuse sequential cost volumes from RGB-D video. The method constructs a 3D cost volume per frame, fuses it with the previous frame’s volume through self- and cross-attention on ray-wise features, and regresses depth from a probability volume followed by NLSPN refinement. Training uses a combination of depth loss and cross-entropy supervision on the probability distribution over depth planes, with soft labels derived from the two nearest planes. RayFusion delivers state-of-the-art or competitive results on KITTI, VOID, and ScanNetV2 with a substantially smaller parameter footprint (≈1.15M) and demonstrates robustness to varying sparsity and cross-dataset generalization, albeit with a noted high memory demand due to 3D convolutions and attention maps.

Abstract

In this paper, we present a learning-based framework for sparse depth video completion. Given a sparse depth map and a color image at a certain viewpoint, our approach makes a cost volume that is constructed on depth hypothesis planes. To effectively fuse sequential cost volumes of the multiple viewpoints for improved depth completion, we introduce a learning-based cost volume fusion framework, namely RayFusion, that effectively leverages the attention mechanism for each pair of overlapped rays in adjacent cost volumes. As a result of leveraging feature statistics accumulated over time, our proposed framework consistently outperforms or rivals state-of-the-art approaches on diverse indoor and outdoor datasets, including the KITTI Depth Completion benchmark, VOID Depth Completion benchmark, and ScanNetV2 dataset, using much fewer network parameters.
Paper Structure (14 sections, 6 equations, 7 figures, 5 tables)

This paper contains 14 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Depth video completion result of our RayFusion framework. The framework takes RGB and sparse depth (0.1% density) video pairs as input (left) and infers completed depth maps (middle). Additionally, we show 3D reconstructions using raw sparse depths (top right) and the completed depths (bottom right). See the supplementary video for various video depth completion results.
  • Figure 2: Illustration of the proposed cost volume fusion scheme. A cost volume is constructed on $D$ depth hypothesis planes and each voxel contains a $C$-dimensional feature vector. When fusing two aligned cost volumes ($\mathbf{V}'_{(t-1)\rightarrow t},\mathbf{V}_{t}$) (b), the proposed scheme (c) applies the attention mechanism into feature sequences corresponding to rays. It is computationally- and memory-efficient than the naive approach (d) of calculating the attention for all features in cost volumes.
  • Figure 3: Overall pipeline of our framework. For each frame, our framework infers a cost volume from a single-view RGB-D image (Section \ref{['sec:cost_vol_gen']}) and then fuses the cost volume with the cost volume updated up to the previous frame (Section \ref{['sec:cost_fusion']}). The fused cost volume is used for completed depth regression (Section \ref{['sec:regress']}) and becomes the cost volume for fusion at the next frame. Finally, the completed depth is refined by non-local spatial propagation networks (NLSPN).
  • Figure 4: Our ray fusion module.
  • Figure 5: Visual comparison of completed depths (top) on the ScanNetV2 test set. Error maps of completed depths are also presented (bottom).
  • ...and 2 more figures