Table of Contents
Fetching ...

Dense Depth from Event Focal Stack

Kenta Horikawa, Mariko Isogawa, Hideo Saito, Shohei Mori

TL;DR

This work tackles dense depth estimation from event streams by introducing an event focal stack generated from a lens focus sweep on an event camera. The events are voxelized into a compact stack $V \in \mathbb{R}^{W \times H \times B}$ (with $B=5$) and processed by a U-Net–style network to predict a dense inverse depth map $D_{pred}$, trained with a mean squared error loss against ground-truth depths. Synthetic data produced in Blender provides supervision, while a lens-breathing correction via homographies and real-world fine-tuning bridge the domain gap to real events. Empirical results show improvements over a depth-from-defocus baseline on both synthetic and real data, and the method demonstrates robustness in low-light scenarios, albeit with limitations on textureless regions and dynamic scenes, underscoring the need for better simulators and domain adaptation strategies.

Abstract

We propose a method for dense depth estimation from an event stream generated when sweeping the focal plane of the driving lens attached to an event camera. In this method, a depth map is inferred from an ``event focal stack'' composed of the event stream using a convolutional neural network trained with synthesized event focal stacks. The synthesized event stream is created from a focal stack generated by Blender for any arbitrary 3D scene. This allows for training on scenes with diverse structures. Additionally, we explored methods to eliminate the domain gap between real event streams and synthetic event streams. Our method demonstrates superior performance over a depth-from-defocus method in the image domain on synthetic and real datasets.

Dense Depth from Event Focal Stack

TL;DR

This work tackles dense depth estimation from event streams by introducing an event focal stack generated from a lens focus sweep on an event camera. The events are voxelized into a compact stack (with ) and processed by a U-Net–style network to predict a dense inverse depth map , trained with a mean squared error loss against ground-truth depths. Synthetic data produced in Blender provides supervision, while a lens-breathing correction via homographies and real-world fine-tuning bridge the domain gap to real events. Empirical results show improvements over a depth-from-defocus baseline on both synthetic and real data, and the method demonstrates robustness in low-light scenarios, albeit with limitations on textureless regions and dynamic scenes, underscoring the need for better simulators and domain adaptation strategies.

Abstract

We propose a method for dense depth estimation from an event stream generated when sweeping the focal plane of the driving lens attached to an event camera. In this method, a depth map is inferred from an ``event focal stack'' composed of the event stream using a convolutional neural network trained with synthesized event focal stacks. The synthesized event stream is created from a focal stack generated by Blender for any arbitrary 3D scene. This allows for training on scenes with diverse structures. Additionally, we explored methods to eliminate the domain gap between real event streams and synthetic event streams. Our method demonstrates superior performance over a depth-from-defocus method in the image domain on synthetic and real datasets.

Paper Structure

This paper contains 19 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The proposed framework for a dense depth map only from an event focal stack. We collected the datasets in both synthetic and real-world (Sec \ref{['sec:dataset']}) environments for this framework. By voxelizing the events from the focus sweep into an event focal stack (Sec \ref{['sec:voxelize']}), the data is transformed into a format compatible with a U-Net like CNN architecture and then input into the network (Sec \ref{['sec:network']}). We aim to bridge the domain gap between synthetic and real-world data by fine-tuning the model, initially trained on the synthetic dataset, with real-world data (Sec \ref{['sec:finetune']}).
  • Figure 2: Collecting real-captured data. We captured the focal sweep events by event camera and computer-controlled lens. To avoid the impact of breathing, we use homography matrices $H[k] = \mathbb{R}^{3 \times 3}$ calculated by 330 images of a circular checkerboard for the correction.
  • Figure 3: Qualitative comparison of $bin$ impacts. Inverse depth images indicate that the distance increases as the color transitions from orange to purple. Differential images show that the error increases as the color transitions from blue to red. Both ESIM and DVS-Voltmeter, $bin=5$ shows the smallest error.
  • Figure 4: Qualitative comparison of the impact of polarity integration. It seems no significant differences between $normal$, $ppnn$ and $pnpn$.
  • Figure 5: Qualitative comparison of events using ESIM, DVS-Voltmeter and real-captured event. The captured scene is one in which boxes are arranged to become progressively more distant from left to right. Although the real-captured events can be observed negative events (red dots) and noisy events, the events generated by ESIM and DVS-Voltmeter are hard to observe negative events and appear less noisy.
  • ...and 2 more figures