Table of Contents
Fetching ...

A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization

Hongwei Ren, Jiadong Zhu, Yue Zhou, Haotian FU, Yulong Huang, Bojun Cheng

TL;DR

This work tackles 6-DOF pose relocalization for event cameras by introducing PEPNet, a lightweight point-based network that directly processes raw Event Cloud data represented as a 3D pseudo-point cloud with coordinates $(x,y,t)$. A hierarchical structure preserves spatial and implicit temporal information, while an Attentive Bi-LSTM explicitly captures temporal dependencies to regress pose with a simple $L_2$-style objective and weight regularization. PEPNet achieves state-of-the-art results on indoor IJRR and outdoor M3ED datasets with far fewer parameters than frame-based approaches, and a tiny variant with only $0.5 ext{\%}$ of parameters delivers comparable performance, highlighting strong potential for edge-enabled CPR on event cameras. The method emphasizes end-to-end processing, robustness across splits, and low-latency inference, making it practical for real-time, power-constrained scenarios.

Abstract

Event cameras exhibit remarkable attributes such as high dynamic range, asynchronicity, and low latency, making them highly suitable for vision tasks that involve high-speed motion in challenging lighting conditions. These cameras implicitly capture movement and depth information in events, making them appealing sensors for Camera Pose Relocalization (CPR) tasks. Nevertheless, existing CPR networks based on events neglect the pivotal fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we introduce PEPNet, a simple and effective point-based network designed to regress six degrees of freedom (6-DOFs) event camera poses. We rethink the relationship between the event camera and CPR tasks, leveraging the raw Point Cloud directly as network input to harness the high-temporal resolution and inherent sparsity of events. PEPNet is adept at abstracting the spatial and implicit temporal features through hierarchical structure and explicit temporal features by Attentive Bi-directional Long Short-Term Memory (A-Bi-LSTM). By employing a carefully crafted lightweight design, PEPNet delivers state-of-the-art (SOTA) performance on both indoor and outdoor datasets with meager computational resources. Specifically, PEPNet attains a significant 38% and 33% performance improvement on the random split IJRR and M3ED datasets, respectively. Moreover, the lightweight design version PEPNet$_{tiny}$ accomplishes results comparable to the SOTA while employing a mere 0.5% of the parameters.

A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization

TL;DR

This work tackles 6-DOF pose relocalization for event cameras by introducing PEPNet, a lightweight point-based network that directly processes raw Event Cloud data represented as a 3D pseudo-point cloud with coordinates . A hierarchical structure preserves spatial and implicit temporal information, while an Attentive Bi-LSTM explicitly captures temporal dependencies to regress pose with a simple -style objective and weight regularization. PEPNet achieves state-of-the-art results on indoor IJRR and outdoor M3ED datasets with far fewer parameters than frame-based approaches, and a tiny variant with only of parameters delivers comparable performance, highlighting strong potential for edge-enabled CPR on event cameras. The method emphasizes end-to-end processing, robustness across splits, and low-latency inference, making it practical for real-time, power-constrained scenarios.

Abstract

Event cameras exhibit remarkable attributes such as high dynamic range, asynchronicity, and low latency, making them highly suitable for vision tasks that involve high-speed motion in challenging lighting conditions. These cameras implicitly capture movement and depth information in events, making them appealing sensors for Camera Pose Relocalization (CPR) tasks. Nevertheless, existing CPR networks based on events neglect the pivotal fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we introduce PEPNet, a simple and effective point-based network designed to regress six degrees of freedom (6-DOFs) event camera poses. We rethink the relationship between the event camera and CPR tasks, leveraging the raw Point Cloud directly as network input to harness the high-temporal resolution and inherent sparsity of events. PEPNet is adept at abstracting the spatial and implicit temporal features through hierarchical structure and explicit temporal features by Attentive Bi-directional Long Short-Term Memory (A-Bi-LSTM). By employing a carefully crafted lightweight design, PEPNet delivers state-of-the-art (SOTA) performance on both indoor and outdoor datasets with meager computational resources. Specifically, PEPNet attains a significant 38% and 33% performance improvement on the random split IJRR and M3ED datasets, respectively. Moreover, the lightweight design version PEPNet accomplishes results comparable to the SOTA while employing a mere 0.5% of the parameters.
Paper Structure (26 sections, 14 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 14 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The average results using the random split method benchmarked on the CPR dataset mueggler2017event. The vertical axis represents the combined rotational and translational errors (m+rad). PEPNet is the first point-based CPR network for event cameras.
  • Figure 2: Two different event-based processing methods, frame-based and point-based.
  • Figure 3: PEPNet overall architecture (the time resolution of $t_1, t_2,... t_n$ is $1\mu s$). The input Event Cloud undergoes direct handling through a sliding window, sampling, and normalization, eliminating the need for any format conversion. Sequentially, the input passes through $S_{num}$ hierarchy structures for spatial feature abstraction and extraction. It further traverses a bidirectional LSTM for temporal feature extraction, culminating in a regressor responsible for 6-DOFs camera pose relocalization.
  • Figure 4: Event-based CPR Dataset visualization.
  • Figure 5: Error distribution of event-based CPR results achieved by PEPNet using a random split. (a) Translation errors. (b) Rotation errors.
  • ...and 1 more figures