Table of Contents
Fetching ...

Learning Monocular Depth from Focus with Event Focal Stack

Chenxu Jiang, Mingyuan Lin, Chi Zhang, Zhenghai Wang, Lei Yu

TL;DR

This work tackles monocular depth estimation from focal cues by leveraging an Event Focal Stack to overcome the limited sampling rate of conventional cameras. The EDFF network fuses event-based representations—the Event Voxel Grid and the Event Depth Surface—via a Focal-Distance-guided Cross-Modal (FDCM) attention mechanism and refines predictions with a Multi-level Depth Fusion Block in a UNet-like encoder-decoder. Two synthetic datasets, EFS-NYUv2 and EFS-Blender, are created for training and evaluation, and EDFF outperforms state-of-the-art frame-based methods on both quantitative metrics (RMSE, AbsRel, delta1, delta2, delta3) and qualitative depth fidelity, with fewer parameters. Limitations include the sparsity of event-derived depth preventing dense maps, suggesting future work to combine with traditional focal stacks.

Abstract

Depth from Focus estimates depth by determining the moment of maximum focus from multiple shots at different focal distances, i.e. the Focal Stack. However, the limited sampling rate of conventional optical cameras makes it difficult to obtain sufficient focus cues during the focal sweep. Inspired by biological vision, the event camera records intensity changes over time in extremely low latency, which provides more temporal information for focus time acquisition. In this study, we propose the EDFF Network to estimate sparse depth from the Event Focal Stack. Specifically, we utilize the event voxel grid to encode intensity change information and project event time surface into the depth domain to preserve per-pixel focal distance information. A Focal-Distance-guided Cross-Modal Attention Module is presented to fuse the information mentioned above. Additionally, we propose a Multi-level Depth Fusion Block designed to integrate results from each level of a UNet-like architecture and produce the final output. Extensive experiments validate that our method outperforms existing state-of-the-art approaches.

Learning Monocular Depth from Focus with Event Focal Stack

TL;DR

This work tackles monocular depth estimation from focal cues by leveraging an Event Focal Stack to overcome the limited sampling rate of conventional cameras. The EDFF network fuses event-based representations—the Event Voxel Grid and the Event Depth Surface—via a Focal-Distance-guided Cross-Modal (FDCM) attention mechanism and refines predictions with a Multi-level Depth Fusion Block in a UNet-like encoder-decoder. Two synthetic datasets, EFS-NYUv2 and EFS-Blender, are created for training and evaluation, and EDFF outperforms state-of-the-art frame-based methods on both quantitative metrics (RMSE, AbsRel, delta1, delta2, delta3) and qualitative depth fidelity, with fewer parameters. Limitations include the sparsity of event-derived depth preventing dense maps, suggesting future work to combine with traditional focal stacks.

Abstract

Depth from Focus estimates depth by determining the moment of maximum focus from multiple shots at different focal distances, i.e. the Focal Stack. However, the limited sampling rate of conventional optical cameras makes it difficult to obtain sufficient focus cues during the focal sweep. Inspired by biological vision, the event camera records intensity changes over time in extremely low latency, which provides more temporal information for focus time acquisition. In this study, we propose the EDFF Network to estimate sparse depth from the Event Focal Stack. Specifically, we utilize the event voxel grid to encode intensity change information and project event time surface into the depth domain to preserve per-pixel focal distance information. A Focal-Distance-guided Cross-Modal Attention Module is presented to fuse the information mentioned above. Additionally, we propose a Multi-level Depth Fusion Block designed to integrate results from each level of a UNet-like architecture and produce the final output. Extensive experiments validate that our method outperforms existing state-of-the-art approaches.
Paper Structure (11 sections, 7 equations, 3 figures, 2 tables)

This paper contains 11 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (a) Illustration of the thin-lens model. (b) The maximum focus may be indiscernible in frame-based focal stacks due to the low sampling rate, which can be addressed by using an event-based focal stack with high temporal resolution.
  • Figure 2: Overview of EDFF. A shallow feature extraction is adopted for each pair $\{\mathcal{E}_V,\mathcal{E}_D\}$. A FDCM attention is used to integrate features from the event domain $f_{V}$ and the depth domain $f_D$. The integrated features are then processed through a UNet-like architecture to thoroughly extract focus information and generate coarse results at multiple scales, which are visually differentiated using arrows of different colors. Finally, the output depths from various levels are fused using a MDFB to produce the final predicted depth.
  • Figure 3: Qualitative comparison on the EFS-NYUv2 (top) and EFS-Blender (bottom) dataset. The warmer the color, the shallower the depth. We mark the retrained methods with * for identification. Details are zoomed in for a better view.