Table of Contents
Fetching ...

DepthLab: From Partial to Complete

Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo

TL;DR

DepthLab addresses the challenge of missing depth information by introducing an RGB-conditioned diffusion-based depth inpainting model. It employs a dual-branch architecture with a Reference U-Net for RGB guidance and an Estimation U-Net for depth completion, using layer-by-layer cross-attention to fuse information and preserve the known depth scale. Trained on synthetic RGB-D data, DepthLab generalizes to diverse real-world tasks, enabling 3D scene inpainting, text-to-3D generation, refined sparse-view reconstruction with DUST3R, and LiDAR depth completion, while achieving superior numerical and visual quality against discriminative and diffusion-based baselines. The work demonstrates robust performance across zero-shot benchmarks and multiple applications, highlighting DepthLab as a potential foundation model for depth-related tasks and future extensions to faster sampling and normal estimation.

Abstract

Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at https://johanan528.github.io/depthlab_web/.

DepthLab: From Partial to Complete

TL;DR

DepthLab addresses the challenge of missing depth information by introducing an RGB-conditioned diffusion-based depth inpainting model. It employs a dual-branch architecture with a Reference U-Net for RGB guidance and an Estimation U-Net for depth completion, using layer-by-layer cross-attention to fuse information and preserve the known depth scale. Trained on synthetic RGB-D data, DepthLab generalizes to diverse real-world tasks, enabling 3D scene inpainting, text-to-3D generation, refined sparse-view reconstruction with DUST3R, and LiDAR depth completion, while achieving superior numerical and visual quality against discriminative and diffusion-based baselines. The work demonstrates robust performance across zero-shot benchmarks and multiple applications, highlighting DepthLab as a potential foundation model for depth-related tasks and future extensions to faster sampling and normal estimation.

Abstract

Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at https://johanan528.github.io/depthlab_web/.

Paper Structure

This paper contains 14 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The training process of DepthLab. First, we apply random masking to the ground truth depth to create the masked depth, followed by interpolation. Both the interpolated masked depth and the original depth undergo random scale normalization before being fed into the encoder. The Reference U-Net extracts RGB features, while the Estimation U-Net takes the noisy depth, masked depth, and encoded mask as input. Layer-by-layer feature fusion allows for finer-grained visual guidance, achieving high-quality depth predictions even in large or complex masked regions.
  • Figure 2: Qualitative comparison of various methods on different datasets. In the second column, black represents the known regions, while white indicates the predicted areas. Notably, to emphasize the contrast, we reattach the known ground truth depth to the corresponding positions in the right-side visualizations of the depth maps. Other methods exhibit significant geometric inconsistency.
  • Figure 3: Visualization of gaussian inpainting. By projecting depth directly into three-dimensional space as initial points, natural 3D consistency is maintained, enabling texture editing and object addition. Please zoom in to view more details.
  • Figure 4: Visualization of 3d scene generation.Left: Depth comparison. "Align" represents the least-square method and shows clear geometric inconsistencies at boundaries. While LucidDreamer reduces these inconsistencies, it compromises the accuracy of the newly estimated depth. In contrast, our model produces consistent and accurate depth. Right: The improved depth estimation from our model leads to superior 3D scene generation results.
  • Figure 5: Visualization of sparse-view reconstruction with DUST3R.Left: Compared to InstantSplat fan2024instantsplat, which directly uses point cloud from DUST3R as initialization, our method produces sharper and clearer depth in non-matching regions. Right: Using our method for improved initialization results in higher-quality Gaussian splatting rendering. Please zoom in for details.