Table of Contents
Fetching ...

EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State Estimation and 3D Dense Mapping

Weipeng Guan, Peiyu Chen, Huibin Zhao, Yu Wang, Peng Lu

TL;DR

EVI-SAM addresses robust, real-time $6$-DoF pose tracking and dense 3D mapping with a monocular event camera by fusing events, images, and IMU in a tightly-coupled hybrid framework. It combines event-based 2D-2D photometric alignment with direct pose constraints in a sliding-window optimization, and introduces an image-guided dense mapping pipeline that reconstructs dense depth and texture via region-growing inpainting and TSDF fusion. The work claims to be the first non-learning approach for monocular event-based dense mapping and demonstrates strong tracking and mapping performance across HDR and aggressive-motion scenarios, including onboard handheld evaluation. The results indicate substantial improvements in robustness and density over existing event-based and image-based baselines, with practical implications for real-time navigation and obstacle avoidance in challenging environments.

Abstract

Event cameras are bio-inspired, motion-activated sensors that demonstrate substantial potential in handling challenging situations, such as motion blur and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the problem of 6 DoF pose tracking and 3D reconstruction using monocular event camera. A novel event-based hybrid tracking framework is designed to estimate the pose, leveraging the robustness of feature matching and the precision of direct alignment. Specifically, we develop an event-based 2D-2D alignment to construct the photometric constraint, and tightly integrate it with the event-based reprojection constraint. The mapping module recovers the dense and colorful depth of the scene through the image-guided event-based mapping method. Subsequently, the appearance, texture, and surface mesh of the 3D scene can be reconstructed by fusing the dense depth map from multiple viewpoints using truncated signed distance function (TSDF) fusion. To the best of our knowledge, this is the first non-learning work to realize event-based dense mapping. Numerical evaluations are performed on both publicly available and self-collected datasets, which qualitatively and quantitatively demonstrate the superior performance of our method. Our EVI-SAM effectively balances accuracy and robustness while maintaining computational efficiency, showcasing superior pose tracking and dense mapping performance in challenging scenarios. Video Demo: https://youtu.be/Nn40U4e5Si8.

EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State Estimation and 3D Dense Mapping

TL;DR

EVI-SAM addresses robust, real-time -DoF pose tracking and dense 3D mapping with a monocular event camera by fusing events, images, and IMU in a tightly-coupled hybrid framework. It combines event-based 2D-2D photometric alignment with direct pose constraints in a sliding-window optimization, and introduces an image-guided dense mapping pipeline that reconstructs dense depth and texture via region-growing inpainting and TSDF fusion. The work claims to be the first non-learning approach for monocular event-based dense mapping and demonstrates strong tracking and mapping performance across HDR and aggressive-motion scenarios, including onboard handheld evaluation. The results indicate substantial improvements in robustness and density over existing event-based and image-based baselines, with practical implications for real-time navigation and obstacle avoidance in challenging environments.

Abstract

Event cameras are bio-inspired, motion-activated sensors that demonstrate substantial potential in handling challenging situations, such as motion blur and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the problem of 6 DoF pose tracking and 3D reconstruction using monocular event camera. A novel event-based hybrid tracking framework is designed to estimate the pose, leveraging the robustness of feature matching and the precision of direct alignment. Specifically, we develop an event-based 2D-2D alignment to construct the photometric constraint, and tightly integrate it with the event-based reprojection constraint. The mapping module recovers the dense and colorful depth of the scene through the image-guided event-based mapping method. Subsequently, the appearance, texture, and surface mesh of the 3D scene can be reconstructed by fusing the dense depth map from multiple viewpoints using truncated signed distance function (TSDF) fusion. To the best of our knowledge, this is the first non-learning work to realize event-based dense mapping. Numerical evaluations are performed on both publicly available and self-collected datasets, which qualitatively and quantitatively demonstrate the superior performance of our method. Our EVI-SAM effectively balances accuracy and robustness while maintaining computational efficiency, showcasing superior pose tracking and dense mapping performance in challenging scenarios. Video Demo: https://youtu.be/Nn40U4e5Si8.
Paper Structure (45 sections, 25 equations, 17 figures, 7 tables)

This paper contains 45 sections, 25 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: System overview. The EVI-SAM algorithm takes events, images, and IMU as inputs, enabling the recovery of both camera pose and dense map of the scene. The mapping process takes raw event streams as input, using images for guidance, and produces dense and textured 3D mapping as output. The tracking thread takes event, image, and IMU as input, and constructs the feature-based and direct-based constraints to estimate the 6-DoF pose.
  • Figure 2: Direct event-based alignment. (a) Event-based 2D-2D alignment: The 2D event mat in the current timestamp (a1) is warped to the 2D event mat in the previous timestamp (a2). The result (a3) is the alignment between the 2D current event mat (white) and the previous 2D event mat (red). (b) Event-based 2D-3D alignment: The current 2D event mat aggregated through a small number of events(b1) is warped to the projected mat recovered from the event-based 3D semi-dense depth in (b2). The result (b3) is a good alignment between the 2D current event mat (white) and the projected 3D event-based map (color).
  • Figure 3: The model of event-based back-projection and space-sweep calculations across different depth planes of the DSI.
  • Figure 4: The model of our event-based dense mapping incorporates edges derived from the intensity image as guidance. The upper layer represents the event-based semi-dense depth. This layer includes areas where the depth is known (regions successfully recovered through semi-dense mapping, marked in red) and areas with unknown depth (marked in black). The lower layer represents the intensity image with boundary information after segmentation. Since events are triggered in regions with edges, the semi-dense depth and the intensity image edges at the corresponding locations are consistent.
  • Figure 5: The event-based semi-dense and dense mapping of our EVI-SAM. (a) The raw event stream and (g) the intensity image from the event camera; (b) The disparity space image (DSI) of the reference view (RV) point; (c) The purely event-based semi-dense depth generated from our EVI-SAM; (d) The point cloud of the event-based semi-dense depth; (e) The occupied node of the semi-dense mapping after TSDF-fusion; (f) The surface mesh of the semi-dense mapping; (h) The segmentation on the image; (i) The event-based dense depth generated from our EVI-SAM; (j) The point cloud of the event-based dense depth with texture information; (k) The occupied node of the dense mapping after TSDF-fusion; (l) The surface mesh of the dense mapping;
  • ...and 12 more figures