Table of Contents
Fetching ...

Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

Haoxian Zhou, Chuanzhi Xu, Langyi Chen, Haodong Chen, Yuk Ying Chung, Qiang Qu, Xaoming Chen, Weidong Cai

TL;DR

Event cameras offer high temporal resolution but existing methods often convert streams to dense frames, losing sparsity and temporal detail. The authors present a sparse, point-cloud-based pipeline that uses rasterized event representations with Sobel edge enhancement, plus two temporal modules (Event Temporal Slicing Convolution and Event Slice Sequencing) to capture short-term dynamics. These components are integrated with multiple backbones (PointNet, DGCNN, Point Transformer) and evaluated on DHP19, showing consistent MPJPE improvements in 2D and 3D and real-time performance. The approach reduces computation relative to frame-based methods and enhances pose estimation robustness under sparse, challenging conditions.

Abstract

Human pose estimation focuses on predicting body keypoints to analyze human motion. Event cameras provide high temporal resolution and low latency, enabling robust estimation under challenging conditions. However, most existing methods convert event streams into dense event frames, which adds extra computation and sacrifices the high temporal resolution of the event signal. In this work, we aim to exploit the spatiotemporal properties of event streams based on point cloud-based framework, designed to enhance human pose estimation performance. We design Event Temporal Slicing Convolution module to capture short-term dependencies across event slices, and combine it with Event Slice Sequencing module for structured temporal modeling. We also apply edge enhancement in point cloud-based event representation to enhance spatial edge information under sparse event conditions to further improve performance. Experiments on the DHP19 dataset show our proposed method consistently improves performance across three representative point cloud backbones: PointNet, DGCNN, and Point Transformer.

Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

TL;DR

Event cameras offer high temporal resolution but existing methods often convert streams to dense frames, losing sparsity and temporal detail. The authors present a sparse, point-cloud-based pipeline that uses rasterized event representations with Sobel edge enhancement, plus two temporal modules (Event Temporal Slicing Convolution and Event Slice Sequencing) to capture short-term dynamics. These components are integrated with multiple backbones (PointNet, DGCNN, Point Transformer) and evaluated on DHP19, showing consistent MPJPE improvements in 2D and 3D and real-time performance. The approach reduces computation relative to frame-based methods and enhances pose estimation robustness under sparse, challenging conditions.

Abstract

Human pose estimation focuses on predicting body keypoints to analyze human motion. Event cameras provide high temporal resolution and low latency, enabling robust estimation under challenging conditions. However, most existing methods convert event streams into dense event frames, which adds extra computation and sacrifices the high temporal resolution of the event signal. In this work, we aim to exploit the spatiotemporal properties of event streams based on point cloud-based framework, designed to enhance human pose estimation performance. We design Event Temporal Slicing Convolution module to capture short-term dependencies across event slices, and combine it with Event Slice Sequencing module for structured temporal modeling. We also apply edge enhancement in point cloud-based event representation to enhance spatial edge information under sparse event conditions to further improve performance. Experiments on the DHP19 dataset show our proposed method consistently improves performance across three representative point cloud backbones: PointNet, DGCNN, and Point Transformer.

Paper Structure

This paper contains 11 sections, 14 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed pipeline. Event point clouds are rasterized with Sobel-based spatial edge enhancement and fed into the backbone. Temporal modeling with ES-Seq and ETSC refines the features, and SimDR [SimDR] decodes 2D poses from each view, which are triangulated into the final 3D pose.
  • Figure 2: Structure of the Event Slice Sequencing (ES-Seq) module.
  • Figure 3: Structure of the Event Temporal Slice Convolution (ETSC) module.
  • Figure 4: Visualization of results from different models on the DHP19 dataset. (a–b) 2D results from cam3 view (baseline vs. ours). (c–d) 3D results via triangulation from cam2 & cam3 views (baseline vs. ours).