Deep Cost Ray Fusion for Sparse Depth Video Completion
Jungeon Kim, Soongjin Kim, Jaesik Park, Seungyong Lee
TL;DR
This work tackles sparse depth video completion by introducing RayFusion, a cost-volume fusion framework that performs ray-wise attention over depth hypotheses to fuse sequential cost volumes from RGB-D video. The method constructs a 3D cost volume per frame, fuses it with the previous frame’s volume through self- and cross-attention on ray-wise features, and regresses depth from a probability volume followed by NLSPN refinement. Training uses a combination of $L_1$ depth loss and cross-entropy supervision on the probability distribution over depth planes, with soft labels derived from the two nearest planes. RayFusion delivers state-of-the-art or competitive results on KITTI, VOID, and ScanNetV2 with a substantially smaller parameter footprint (≈1.15M) and demonstrates robustness to varying sparsity and cross-dataset generalization, albeit with a noted high memory demand due to 3D convolutions and attention maps.
Abstract
In this paper, we present a learning-based framework for sparse depth video completion. Given a sparse depth map and a color image at a certain viewpoint, our approach makes a cost volume that is constructed on depth hypothesis planes. To effectively fuse sequential cost volumes of the multiple viewpoints for improved depth completion, we introduce a learning-based cost volume fusion framework, namely RayFusion, that effectively leverages the attention mechanism for each pair of overlapped rays in adjacent cost volumes. As a result of leveraging feature statistics accumulated over time, our proposed framework consistently outperforms or rivals state-of-the-art approaches on diverse indoor and outdoor datasets, including the KITTI Depth Completion benchmark, VOID Depth Completion benchmark, and ScanNetV2 dataset, using much fewer network parameters.
