Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation
Haoxian Zhou, Chuanzhi Xu, Langyi Chen, Haodong Chen, Yuk Ying Chung, Qiang Qu, Xaoming Chen, Weidong Cai
TL;DR
Event cameras offer high temporal resolution but existing methods often convert streams to dense frames, losing sparsity and temporal detail. The authors present a sparse, point-cloud-based pipeline that uses rasterized event representations with Sobel edge enhancement, plus two temporal modules (Event Temporal Slicing Convolution and Event Slice Sequencing) to capture short-term dynamics. These components are integrated with multiple backbones (PointNet, DGCNN, Point Transformer) and evaluated on DHP19, showing consistent MPJPE improvements in 2D and 3D and real-time performance. The approach reduces computation relative to frame-based methods and enhances pose estimation robustness under sparse, challenging conditions.
Abstract
Human pose estimation focuses on predicting body keypoints to analyze human motion. Event cameras provide high temporal resolution and low latency, enabling robust estimation under challenging conditions. However, most existing methods convert event streams into dense event frames, which adds extra computation and sacrifices the high temporal resolution of the event signal. In this work, we aim to exploit the spatiotemporal properties of event streams based on point cloud-based framework, designed to enhance human pose estimation performance. We design Event Temporal Slicing Convolution module to capture short-term dependencies across event slices, and combine it with Event Slice Sequencing module for structured temporal modeling. We also apply edge enhancement in point cloud-based event representation to enhance spatial edge information under sparse event conditions to further improve performance. Experiments on the DHP19 dataset show our proposed method consistently improves performance across three representative point cloud backbones: PointNet, DGCNN, and Point Transformer.
