Table of Contents
Fetching ...

Event6D: Event-based Novel Object 6D Pose Tracking

Jae-Young Kang, Hoonehee Cho, Taeyeop Lee, Minjun Kang, Bowen Wen, Youngho Kim, Kuk-Jin Yoon

Abstract

Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.

Event6D: Event-based Novel Object 6D Pose Tracking

Abstract

Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.

Paper Structure

This paper contains 36 sections, 22 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Conventional RGB-D based methods often fail under highly dynamic scenes due to limited frame rate from common RGB-D cameras. Our EventTrack6D addresses this issue by reconstructing dual modalities, image and depth, between consecutive depth frames to bridge the gap with event data. This enables inference at finer temporal intervals and yields robust tracking over highly dynamic motion.
  • Figure 2: Overview of our EventTrack6D. EventTrack6D consists of a dual-modal reconstruction module and a pose refinement module. It can perform 6D pose tracking over high-frequency event stream, despite the limited frame rate of depth images which results in missing depth information between time intervals $\tau=0$ and $\tau=t$. To achieve this, the dual-modal reconstruction module takes as input the most recent depth frame $D_0$, the event stream $E_{0,t}$ accumulated from that frame to the current time $t$ where depth frame is missing, as well as the event stream $E_{t-\Delta t,t}$ from the most recent dual-modal reconstruction to the current time. From these inputs, it reconstructs the current intensity image $I_t$, and depth $D_t$. These reconstructed modalities are then used in a pose refinement module to estimate the 6D pose transformation from time $t\!-\!\Delta t$ to $t$.
  • Figure 3: System designed for acquiring the Event6D dataset. The event camera, RGB-D camera, and motion capture system are all hardware-triggered, temporally synchronized and calibrated.
  • Figure 4: Qualitative comparison of 6D object tracking at 120 FPS on the Event6D dataset. Original FoundationPose(FP) wen2023foundationpose assumes RGB-D input and thus cannot be applied to a high frame rate setting. Note that for $\tau = 0.25, 0.5, 0.75$, ours utilizes its reconstructed depth rather than sensor-captured depth.
  • Figure 5: Qualitative depth-reconstruction results on depth-absent intervals. The future depth $\,D_1\,$ is provided solely for reference and is not used by the method. Despite dynamic motion, our approach reconstructs depth images that preserve coherent object structure and align with the object motion, providing geometric guidance for downstream pose tracking.
  • ...and 9 more figures