Table of Contents
Fetching ...

Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang

TL;DR

Pixel-Perfect Depth tackles flying-pixel artifacts in monocular depth estimation by performing diffusion directly in pixel space, avoiding VAE-induced edge degradation common in latent-diffusion approaches. The authors introduce Semantics-Prompted Diffusion Transformers (SP-DiT) to inject high-level semantic cues from vision foundation models, and Cascade DiT (Cas-DiT) to efficiently scale diffusion to high-resolution depth maps. The method yields state-of-the-art performance across five benchmarks and excels in edge-aware point-cloud evaluation by producing flying-pixel-free depth maps. Limitations include temporal inconsistency in video frames and slower inference relative to discriminative models, with future work targeting video-depth estimation and acceleration.

Abstract

This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

TL;DR

Pixel-Perfect Depth tackles flying-pixel artifacts in monocular depth estimation by performing diffusion directly in pixel space, avoiding VAE-induced edge degradation common in latent-diffusion approaches. The authors introduce Semantics-Prompted Diffusion Transformers (SP-DiT) to inject high-level semantic cues from vision foundation models, and Cascade DiT (Cas-DiT) to efficiently scale diffusion to high-resolution depth maps. The method yields state-of-the-art performance across five benchmarks and excels in edge-aware point-cloud evaluation by producing flying-pixel-free depth maps. Limitations include temporal inconsistency in video frames and slower inference relative to discriminative models, with future work targeting video-depth estimation and acceleration.

Abstract

This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

Paper Structure

This paper contains 20 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We present Pixel-Perfect Depth, a monocular depth estimation model with pixel-space diffusion transformers. Compared to existing discriminative yang2024depthv2depthpro and generative ke2024marigold models, its estimated depth maps can produce high-quality, flying-pixel-free point clouds.
  • Figure 2: Qualitative comparisons. GT(VAE) denotes the ground truth depth map after VAE reconstruction. Existing generative models ke2024marigold use a VAE to compress inputs into the latent space, inevitably introducing flying pixels at edges and details. In contrast, our model directly performs diffusion in pixel space, avoiding these issues. Depth maps are visualized on the point clouds.
  • Figure 3: Overview of Pixel-Perfect Depth. Given an input image, we concatenate it with noise and feed it into the proposed Cascade DiT. Meanwhile, the image is also processed by a pretrained encoder from Vision Foundation Models to extract high-level semantics, forming our Semantics-Prompted DiT. We perform diffusion generation directly in pixel space without using any VAE.
  • Figure 4: Comparison with existing depth foundation models on open-world images. Our model preserves more fine-grained details than Depth Anything v2 yang2024depthv2 and MoGe 2 moge2, while demonstrating significantly higher robustness compared to Depth Pro depthpro.
  • Figure 5: Qualitative point cloud results in complex scenes. Our model produces significantly fewer flying pixels compared to other depth estimation models ke2024marigoldyang2024depthv2depthpro, with depth maps overlaid on the point clouds for visualization.
  • ...and 3 more figures