Table of Contents
Fetching ...

Zero-Shot Depth from Defocus

Yiming Zuo, Hongyu Wen, Venkat Subramanian, Patrick Chen, Karhan Kayan, Mario Bijelic, Felix Heide, Jia Deng

Abstract

Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.

Zero-Shot Depth from Defocus

Abstract

Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.

Paper Structure

This paper contains 32 sections, 12 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Qualitative comparison between ZEDD and DDFF ddff. DDFF uses a very small aperture, so the defocus effect is barely noticeable even when zoomed in. In contrast, our focus stacks exhibit clear, smooth defocus effects. Our depth ground-truth is of higher resolution and denser, and it contains no missing regions due to occlusions.
  • Figure 2: Overview of the FOSSA pipeline. FOSSA consists of two main stages: focus stack feature extraction (blue box), and feature refinement (gray box). For each focus stack extraction layer, the image features are first processed individually and then pass through a stack attention layer for efficient information exchange across the stack. The features are then collapsed along the stack dimension, and pass through a sequence of refinement ViT blocks and finally a DPT head for dense depth output.
  • Figure 3: A gallery of our ZEDD benchmark. ZEDD includes 100 diverse scenes spanning a wide range of indoor and outdoor environments. The dataset features rich geometric structure and provides high-quality ground-truth.
  • Figure 4: Qualitative comparisons with baselines on 3 benchmarks. Our results are both sharp and metrically accurate. While monocular depth methods such as MoGe-2 moge2 also produce sharp results, the scales are not metrically correct due to the scale ambiguity. DfD methods such as HybridDepth hybriddepth fail on ZEDD and Infinigen Defocus. Even on DDFF where HybridDepth is trained on, their results are not as sharp as ours.
  • Figure 5: FOSSA is robust to the aperture sizes, the focus distance distrubiton, and the focus stack size. The variance is small and we consistently outperform baselines under all configurations. Although always trained with a focus stack size of 5, our method can work with as few as 2 images. All results are on the ZEDD validation split.
  • ...and 8 more figures