Table of Contents
Fetching ...

E3D: Event-Based 3D Shape Reconstruction

Alexis Baudron, Zihao W. Wang, Oliver Cossairt, Aggelos K. Katsaggelos

TL;DR

This work tackles dense 3D shape reconstruction from low-power event cameras by casting the problem as multi-view silhouette reconstruction. It introduces an Event-to-Silhouette network (E2S) and a differentiable-rendering driven E3D framework that jointly optimizes silhouettes, camera pose, and a 3D mesh, aided by a synthetic 3D-to-event data generator. On synthetic ShapeNet data and real CeleX experiments, the method demonstrates improved mesh quality and pose estimation and shows resilience to motion blur, while highlighting notable sim-to-real gaps. The approach enables edge-friendly 3D reconstruction for AR/VR and paves the way for event-based 3D sensing with silhouette priors.

Abstract

3D shape reconstruction is a primary component of augmented/virtual reality. Despite being highly advanced, existing solutions based on RGB, RGB-D and Lidar sensors are power and data intensive, which introduces challenges for deployment in edge devices. We approach 3D reconstruction with an event camera, a sensor with significantly lower power, latency and data expense while enabling high dynamic range. While previous event-based 3D reconstruction methods are primarily based on stereo vision, we cast the problem as multi-view shape from silhouette using a monocular event camera. The output from a moving event camera is a sparse point set of space-time gradients, largely sketching scene/object edges and contours. We first introduce an event-to-silhouette (E2S) neural network module to transform a stack of event frames to the corresponding silhouettes, with additional neural branches for camera pose regression. Second, we introduce E3D, which employs a 3D differentiable renderer (PyTorch3D) to enforce cross-view 3D mesh consistency and fine-tune the E2S and pose network. Lastly, we introduce a 3D-to-events simulation pipeline and apply it to publicly available object datasets and generate synthetic event/silhouette training pairs for supervised learning.

E3D: Event-Based 3D Shape Reconstruction

TL;DR

This work tackles dense 3D shape reconstruction from low-power event cameras by casting the problem as multi-view silhouette reconstruction. It introduces an Event-to-Silhouette network (E2S) and a differentiable-rendering driven E3D framework that jointly optimizes silhouettes, camera pose, and a 3D mesh, aided by a synthetic 3D-to-event data generator. On synthetic ShapeNet data and real CeleX experiments, the method demonstrates improved mesh quality and pose estimation and shows resilience to motion blur, while highlighting notable sim-to-real gaps. The approach enables edge-friendly 3D reconstruction for AR/VR and paves the way for event-based 3D sensing with silhouette priors.

Abstract

3D shape reconstruction is a primary component of augmented/virtual reality. Despite being highly advanced, existing solutions based on RGB, RGB-D and Lidar sensors are power and data intensive, which introduces challenges for deployment in edge devices. We approach 3D reconstruction with an event camera, a sensor with significantly lower power, latency and data expense while enabling high dynamic range. While previous event-based 3D reconstruction methods are primarily based on stereo vision, we cast the problem as multi-view shape from silhouette using a monocular event camera. The output from a moving event camera is a sparse point set of space-time gradients, largely sketching scene/object edges and contours. We first introduce an event-to-silhouette (E2S) neural network module to transform a stack of event frames to the corresponding silhouettes, with additional neural branches for camera pose regression. Second, we introduce E3D, which employs a 3D differentiable renderer (PyTorch3D) to enforce cross-view 3D mesh consistency and fine-tune the E2S and pose network. Lastly, we introduce a 3D-to-events simulation pipeline and apply it to publicly available object datasets and generate synthetic event/silhouette training pairs for supervised learning.

Paper Structure

This paper contains 19 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: a) Rendered image of a car overlayed on a textured background. b) Event volume caused by the viewer orbiting around the car (blue: positive event, red: negative event) c) Image of our reconstructed polygon mesh.
  • Figure 2: Overlay of our event frames on top of synthetic images of objects with random textured backgrounds.
  • Figure 3: E3D architecture with mesh supervision. Our network learns to predict pose and object silhouettes from single event frames. The Event-to-Silhouette (E2S) and pose branches share the same encoder (in pink). The predicted silhouettes and poses are used to supervise our multi-view mesh optimization. The event frame is shown with a white background for increased visibility.
  • Figure 4: 3D-to-event synthesis from trajectory generation to synthetic event frames
  • Figure 5: Single Category results on ShapeNet results, we show two views for each reconstruction. From left to right: RGB sequence, PMO videomesh2019 (2 views), Event Frame, PMO re-trained on events (2 views) and E3D (2 views).
  • ...and 3 more figures