Table of Contents
Fetching ...

EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation

Zengyu Wan, Wei Zhai, Yang Cao, Zhengjun Zha

TL;DR

EMoTive introduces an event-guided trajectory framework for 3D motion estimation that directly tackles depth-induced spatio-temporal inconsistencies. It fuses dense Event Voxel spatial cues with temporally precise Event Kymographs, leveraging dual spatio-temporal cost volumes and a density-aware non-uniform NURBS trajectory to model heterogeneous motion. The method yields optical flow and motion-in-depth by sampling trajectories across multiple timestamps, with a multi-view depth consistency scheme and a multi-task supervision loss. A synthetic CarlaEvent3D dataset and experiments on DSEC demonstrate superior accuracy and efficiency, while ablations confirm the benefits of temporal resolution, trajectory order, and density-guided adaptation. Overall, EMoTive provides a compact, fast, and interpretable approach to event-based 3D scene flow that excels in challenging driving scenarios and adverse conditions.

Abstract

Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode fine-grained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.

EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation

TL;DR

EMoTive introduces an event-guided trajectory framework for 3D motion estimation that directly tackles depth-induced spatio-temporal inconsistencies. It fuses dense Event Voxel spatial cues with temporally precise Event Kymographs, leveraging dual spatio-temporal cost volumes and a density-aware non-uniform NURBS trajectory to model heterogeneous motion. The method yields optical flow and motion-in-depth by sampling trajectories across multiple timestamps, with a multi-view depth consistency scheme and a multi-task supervision loss. A synthetic CarlaEvent3D dataset and experiments on DSEC demonstrate superior accuracy and efficiency, while ablations confirm the benefits of temporal resolution, trajectory order, and density-guided adaptation. Overall, EMoTive provides a compact, fast, and interpretable approach to event-based 3D scene flow that excels in challenging driving scenarios and adverse conditions.

Abstract

Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode fine-grained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.

Paper Structure

This paper contains 33 sections, 25 equations, 19 figures, 5 tables, 1 algorithm.

Figures (19)

  • Figure 1: (a) The dilemma faced by motion estimation algorithms between the local smoothness assumption and the real divergence due to motion in depth; (b) (c) The spatio-temporal projection of events (Event Kymograph) provides an clear observation of temporal evolution, enabling heterogeneous spatio-temporal motion analysis; (d) 3D motion can be inferred by modelling spatio-temporal trajectories via event-guided non-uniform parametric curves.
  • Figure 2: Overall pipeline of EMoTive with event-guided trajectory formation scheme. The Event Voxel and Event Kymograph will be first utilized to construct spatio-temporal dual correlations for motion representation. A density-aware adaptation mechanism is then implemented to fuse spatial and temporal features under event guidance to model trajectories adaptively. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion field
  • Figure 3: Projection axes of Event Voxel and Event Kymograph. The Event Kymograph captures fine temporal cues in decoupled spatial axes.
  • Figure 3: Quantitative evaluation of scene flow on DSEC datasets. "PC" refers to the radar point cloud, and "Ev" refers to the event.
  • Figure 4: Event-guided trajectory formation process. Local event density is introduced to adaptively adjust the knots and weights while fusing spatio-temporal features for control points update.
  • ...and 14 more figures