Table of Contents
Fetching ...

DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model

Ming Yuan, Sichao Wang, Chuang Zhang, Lei He, Qing Xu, Jianqiang Wang

TL;DR

DenseFormer reframes outdoor depth completion as a conditional diffusion problem guided by multi-scale RGB and sparse depth features. It introduces three components—Guidance Feature Extraction, Diffusion Process, and Depth Refinement—to progressively denoise a random depth distribution into a dense depth map, with conditioning provided by a feature pyramid and deformable attention fusion. On KITTI, DenseFormer achieves state-of-the-art RMSE and demonstrates clear improvements in edge preservation and detail recovery, validated through comprehensive ablations. The approach highlights the potential of diffusion models for perception tasks and paves the way for diffusion-based depth completion in real-world driving scenarios.

Abstract

The depth completion task is a critical problem in autonomous driving, involving the generation of dense depth maps from sparse depth maps and RGB images. Most existing methods employ a spatial propagation network to iteratively refine the depth map after obtaining an initial dense depth. In this paper, we propose DenseFormer, a novel method that integrates the diffusion model into the depth completion task. By incorporating the denoising mechanism of the diffusion model, DenseFormer generates the dense depth map by progressively refining an initial random depth distribution through multiple iterations. We propose a feature extraction module that leverages a feature pyramid structure, along with multi-layer deformable attention, to effectively extract and integrate features from sparse depth maps and RGB images, which serve as the guiding condition for the diffusion process. Additionally, this paper presents a depth refinement module that applies multi-step iterative refinement across various ranges to the dense depth results generated by the diffusion process. The module utilizes image features enriched with multi-scale information and sparse depth input to further enhance the accuracy of the predicted depth map. Extensive experiments on the KITTI outdoor scene dataset demonstrate that DenseFormer outperforms classical depth completion methods.

DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model

TL;DR

DenseFormer reframes outdoor depth completion as a conditional diffusion problem guided by multi-scale RGB and sparse depth features. It introduces three components—Guidance Feature Extraction, Diffusion Process, and Depth Refinement—to progressively denoise a random depth distribution into a dense depth map, with conditioning provided by a feature pyramid and deformable attention fusion. On KITTI, DenseFormer achieves state-of-the-art RMSE and demonstrates clear improvements in edge preservation and detail recovery, validated through comprehensive ablations. The approach highlights the potential of diffusion models for perception tasks and paves the way for diffusion-based depth completion in real-world driving scenarios.

Abstract

The depth completion task is a critical problem in autonomous driving, involving the generation of dense depth maps from sparse depth maps and RGB images. Most existing methods employ a spatial propagation network to iteratively refine the depth map after obtaining an initial dense depth. In this paper, we propose DenseFormer, a novel method that integrates the diffusion model into the depth completion task. By incorporating the denoising mechanism of the diffusion model, DenseFormer generates the dense depth map by progressively refining an initial random depth distribution through multiple iterations. We propose a feature extraction module that leverages a feature pyramid structure, along with multi-layer deformable attention, to effectively extract and integrate features from sparse depth maps and RGB images, which serve as the guiding condition for the diffusion process. Additionally, this paper presents a depth refinement module that applies multi-step iterative refinement across various ranges to the dense depth results generated by the diffusion process. The module utilizes image features enriched with multi-scale information and sparse depth input to further enhance the accuracy of the predicted depth map. Extensive experiments on the KITTI outdoor scene dataset demonstrate that DenseFormer outperforms classical depth completion methods.

Paper Structure

This paper contains 14 sections, 11 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of the conditional depth denoising process.
  • Figure 2: Overall Architecture of DenseFormer. Sparse depth and RGB image are utilized as inputs, from which multi-scale features are extracted and guide the diffusion denoising process to generate dense depth map from random depth distributions. The output is iteratively refined by the depth refinement module to produce the final dense depth map.
  • Figure 3: The Guidance Feature Extraction Module, extracting multiscale features from depth and image inputs.
  • Figure 4: Illustration of Deformable Attention Network
  • Figure 5: The Guidance Denoising Module. A lightweight denoising network based on feature guidance.
  • ...and 2 more figures