Table of Contents
Fetching ...

YCB-Ev 1.1: Event-vision dataset for 6DoF object pose estimation

Pavel Rojtberg, Thomas Pöllabauer

TL;DR

The paper presents YCB-Ev, the first real-world event-vision dataset with ground-truth 6DoF poses for 21 YCB objects, enabling evaluation of pose estimation across RGB-D and event modalities and facilitating cross-dataset analysis with YCB-V. Ground-truth poses are generated from RGB data using a robust calibration and tracking pipeline and then transferred to the event camera frame, supplemented by tools for pose interpolation and event alignment. The dataset comprises 21 sequences (13,851 frames; 7m43s of event data), including 12 sequences that mirror the YCB-V/BOP subset, and includes challenging conditions such as rapid motion, occlusions, and frame drops. A bias analysis shows that domain gaps between RGB-based training and event-domain evaluation persist, highlighting the need for event-focused training and annotation, as well as potential synthetic data approaches. Overall, YCB-Ev enables direct evaluation of event-based 6DoF pose estimation and cross-dataset generalization, advancing research at the intersection of neuromorphic sensing and 3D object pose estimation.

Abstract

Our work introduces the YCB-Ev dataset, which contains synchronized RGB-D frames and event data that enables evaluating 6DoF object pose estimation algorithms using these modalities. This dataset provides ground truth 6DoF object poses for the same 21 YCB objects that were used in the YCB-Video (YCB-V) dataset, allowing for cross-dataset algorithm performance evaluation. The dataset consists of 21 synchronized event and RGB-D sequences, totalling 13,851 frames (7 minutes and 43 seconds of event data). Notably, 12 of these sequences feature the same object arrangement as the YCB-V subset used in the BOP challenge. Ground truth poses are generated by detecting objects in the RGB-D frames, interpolating the poses to align with the event timestamps, and then transferring them to the event coordinate frame using extrinsic calibration. Our dataset is the first to provide ground truth 6DoF pose data for event streams. Furthermore, we evaluate the generalization capabilities of two state-of-the-art algorithms, which were pre-trained for the BOP challenge, using our novel YCB-V sequences. The dataset is publicly available at https://github.com/paroj/ycbev.

YCB-Ev 1.1: Event-vision dataset for 6DoF object pose estimation

TL;DR

The paper presents YCB-Ev, the first real-world event-vision dataset with ground-truth 6DoF poses for 21 YCB objects, enabling evaluation of pose estimation across RGB-D and event modalities and facilitating cross-dataset analysis with YCB-V. Ground-truth poses are generated from RGB data using a robust calibration and tracking pipeline and then transferred to the event camera frame, supplemented by tools for pose interpolation and event alignment. The dataset comprises 21 sequences (13,851 frames; 7m43s of event data), including 12 sequences that mirror the YCB-V/BOP subset, and includes challenging conditions such as rapid motion, occlusions, and frame drops. A bias analysis shows that domain gaps between RGB-based training and event-domain evaluation persist, highlighting the need for event-focused training and annotation, as well as potential synthetic data approaches. Overall, YCB-Ev enables direct evaluation of event-based 6DoF pose estimation and cross-dataset generalization, advancing research at the intersection of neuromorphic sensing and 3D object pose estimation.

Abstract

Our work introduces the YCB-Ev dataset, which contains synchronized RGB-D frames and event data that enables evaluating 6DoF object pose estimation algorithms using these modalities. This dataset provides ground truth 6DoF object poses for the same 21 YCB objects that were used in the YCB-Video (YCB-V) dataset, allowing for cross-dataset algorithm performance evaluation. The dataset consists of 21 synchronized event and RGB-D sequences, totalling 13,851 frames (7 minutes and 43 seconds of event data). Notably, 12 of these sequences feature the same object arrangement as the YCB-V subset used in the BOP challenge. Ground truth poses are generated by detecting objects in the RGB-D frames, interpolating the poses to align with the event timestamps, and then transferring them to the event coordinate frame using extrinsic calibration. Our dataset is the first to provide ground truth 6DoF pose data for event streams. Furthermore, we evaluate the generalization capabilities of two state-of-the-art algorithms, which were pre-trained for the BOP challenge, using our novel YCB-V sequences. The dataset is publicly available at https://github.com/paroj/ycbev.
Paper Structure (13 sections, 5 figures, 4 tables)

This paper contains 13 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The Intel RealSense D435 RGB-D camera (on the left) and the Prophesee EVK2 event camera (on the right) mounted side by side, along with their respective annotated images. Throughout this paper, the polarity of the events is represented as blue (falling) and green (rising).
  • Figure 2: Our method for joint event and RGB camera calibration relies on a flashing blob pattern that can be detected by both sensor technologies.
  • Figure 3: The object arrangement in our dataset (left) corresponds to the object arrangement in the YCB-V dataset (right)
  • Figure 4: Our dataset contains challenging frames that exhibit fast camera motion and low-light conditions.
  • Figure 5: The 21 sequences in our dataset.