Table of Contents
Fetching ...

A Unified Framework for Human-centric Point Cloud Video Understanding

Yiteng Xu, Kecheng Ye, Xiao Han, Yiming Ren, Xinge Zhu, Yuexin Ma

TL;DR

This work addresses the generalization limitations of existing human-centric PVU methods by proposing UniPVU-Human, a unified framework that exploits human priors (global, part, and point-level) and a self-supervised semantic-guided spatio-temporal representation learning pipeline. It introduces two synthetic priors (HBSeg for body-part segmentation and HMFlow for motion flow) and a self-learning stage that masks body-part patches to learn geometry and dynamics without annotations, followed by hierarchical fine-tuning that fuses global, part, and motion-aware features for downstream tasks. Empirical results on HuCenLife (action recognition) and LIP (3D pose estimation) achieve state-of-the-art performance, with ablations validating the contribution of each module and showing strong semi-supervised robustness. The framework also provides two synthetic datasets (LiDARFlow-Human and LiDARPart-Human) to support future research and demonstrates that human-centric priors significantly improve transferability across diverse PVU tasks.

Abstract

Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications. Previous works usually focus on tackling one specific task and rely on huge labeled data, which has poor generalization capability. Considering that human has specific characteristics, including the structural semantics of human body and the dynamics of human motions, we propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various human-related tasks, including action recognition and 3D pose estimation. All datasets and code will be released soon.

A Unified Framework for Human-centric Point Cloud Video Understanding

TL;DR

This work addresses the generalization limitations of existing human-centric PVU methods by proposing UniPVU-Human, a unified framework that exploits human priors (global, part, and point-level) and a self-supervised semantic-guided spatio-temporal representation learning pipeline. It introduces two synthetic priors (HBSeg for body-part segmentation and HMFlow for motion flow) and a self-learning stage that masks body-part patches to learn geometry and dynamics without annotations, followed by hierarchical fine-tuning that fuses global, part, and motion-aware features for downstream tasks. Empirical results on HuCenLife (action recognition) and LIP (3D pose estimation) achieve state-of-the-art performance, with ablations validating the contribution of each module and showing strong semi-supervised robustness. The framework also provides two synthetic datasets (LiDARFlow-Human and LiDARPart-Human) to support future research and demonstrates that human-centric priors significantly improve transferability across diverse PVU tasks.

Abstract

Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications. Previous works usually focus on tackling one specific task and rely on huge labeled data, which has poor generalization capability. Considering that human has specific characteristics, including the structural semantics of human body and the dynamics of human motions, we propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various human-related tasks, including action recognition and 3D pose estimation. All datasets and code will be released soon.
Paper Structure (34 sections, 8 equations, 7 figures, 8 tables)

This paper contains 34 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: UniPVU-Human extracts human-related prior knowledge at global level, part level, and point level to facilitate subsequent geometric and dynamic representation learning, finally cater to a range of downstream human-centric tasks, such as action recognition, 3D pose estimation, etc.
  • Figure 2: The main pipeline of UniPVU-Human, which can be divided into three stages, including (a) Prior Knowledge Extraction, (b) Semantic-Guided Spatio-temporal Representation Self-learning, and (c) Hierarchical Feature Enhanced Fine-tuning. First, the pre-trained HBSeg and HMFlow are used to provide geometric and dynamic information, including body part segmentation results and point-wise motion flow. Then, our self-learning stage incorporates a body-part-based mask prediction mechanism designed to facilitate the acquisition of geometric and dynamic representations of humans in the absence of annotations. Finally, we integrate global-level, part-level, and point-level features to boost the knowledge transfer to downstream tasks in the fine-tuning stage.
  • Figure 3: Visualization results of HBSeg on HuCenLife xu2023human. We show cases with different densities of LiDAR point cloud, occlusion (yellow circle), and noise (black circle). HBSeg has robust performance even merely trained on our synthesized dataset.
  • Figure 4: Visualization results of HMFlow on HuCenLife xu2023human. We present several cases from near to far relative to the LiDAR sensor. HMFlow has good capability of estimating point flow even for the parts with significant movements (yellow circle), which can provide explicit features of human dynamics.
  • Figure 5: The pipeline of generating the flow from the previous point cloud to the next point cloud. We associate each synthetic LiDAR point to its nearest SMPL vertex, to establish the correspondence between synthetic LiDAR points across different frames by using SMPL vertices indices as medium, so that we can obtain point-wise motion flow.
  • ...and 2 more figures