Table of Contents
Fetching ...

SPiKE: 3D Human Pose from Point Cloud Sequences

Irene Ballester, Ondřej Peterka, Martin Kampel

TL;DR

SPiKE leverages temporal context by adopting a Transformer architecture to encode spatio-temporal relationships between points across the sequence, and ensures efficient processing by the Transformer while preserving spatial integrity per timestamp.

Abstract

3D Human Pose Estimation (HPE) is the task of locating keypoints of the human body in 3D space from 2D or 3D representations such as RGB images, depth maps or point clouds. Current HPE methods from depth and point clouds predominantly rely on single-frame estimation and do not exploit temporal information from sequences. This paper presents SPiKE, a novel approach to 3D HPE using point cloud sequences. Unlike existing methods that process frames of a sequence independently, SPiKE leverages temporal context by adopting a Transformer architecture to encode spatio-temporal relationships between points across the sequence. By partitioning the point cloud into local volumes and using spatial feature extraction via point spatial convolution, SPiKE ensures efficient processing by the Transformer while preserving spatial integrity per timestamp. Experiments on the ITOP benchmark for 3D HPE show that SPiKE reaches 89.19% mAP, achieving state-of-the-art performance with significantly lower inference times. Extensive ablations further validate the effectiveness of sequence exploitation and our algorithmic choices. Code and models are available at: https://github.com/iballester/SPiKE

SPiKE: 3D Human Pose from Point Cloud Sequences

TL;DR

SPiKE leverages temporal context by adopting a Transformer architecture to encode spatio-temporal relationships between points across the sequence, and ensures efficient processing by the Transformer while preserving spatial integrity per timestamp.

Abstract

3D Human Pose Estimation (HPE) is the task of locating keypoints of the human body in 3D space from 2D or 3D representations such as RGB images, depth maps or point clouds. Current HPE methods from depth and point clouds predominantly rely on single-frame estimation and do not exploit temporal information from sequences. This paper presents SPiKE, a novel approach to 3D HPE using point cloud sequences. Unlike existing methods that process frames of a sequence independently, SPiKE leverages temporal context by adopting a Transformer architecture to encode spatio-temporal relationships between points across the sequence. By partitioning the point cloud into local volumes and using spatial feature extraction via point spatial convolution, SPiKE ensures efficient processing by the Transformer while preserving spatial integrity per timestamp. Experiments on the ITOP benchmark for 3D HPE show that SPiKE reaches 89.19% mAP, achieving state-of-the-art performance with significantly lower inference times. Extensive ablations further validate the effectiveness of sequence exploitation and our algorithmic choices. Code and models are available at: https://github.com/iballester/SPiKE
Paper Structure (26 sections, 4 equations, 4 figures, 2 tables)

This paper contains 26 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Importance of exploiting sequence information. When considering only the current frame (sequence length $T$=1), only the right hand is visible in the input point cloud, leading to an incorrect prediction. On the contrary, if we consider past frames ($T$=3), in particular $t_2$ where both hands are visible, SPiKE estimates the position of both arms more accurately. Timestamp ID: 3_02244.
  • Figure 2: SPiKE pipeline. First, each point cloud of the sequence (sequence length = $T$) is partitioned by selecting $N_v$ reference points $P'_i$ and creating local volumes $V_{N_v}$ around them by sampling points within a radius $r$. Point Spatial Convolution extracts spatial features $F_i$ from each local volume. These features are then embedded with the coordinates of their respective reference point $P'_i$ and fed into the Transformer. After a max-pooling layer, an MLP regresses the 3D coordinates of the $M$ joints.
  • Figure 3: Qualitative results. Each pair represents the groundtruth skeletons on the left (keypoints in red) and the joints predicted by the model on the right (keypoints in blue). ID top row: A: 0_01439, B: 2_00220, C: 1_00587, D: 3_02966. ID bottom row: E: 0_01712, F: 2_02827, G: 0_00168, H: 1_01611.
  • Figure 4: Ablations Left: Effect on performance (mAP) and running memory (GB) vs. sequence length $T$, using only past or past and future timestamps. Spatial convolutions are employed for this ablation study. Right: Effect on performance (mAP) vs. sequence length $T$ for spatial convolutions, spatio-temporal (ST) convolutions with temporal kernel size $k_t$ = 3 and $k_t$ = 5. For this ablation, only past timestamps are considered.