Table of Contents
Fetching ...

3D Landmark Detection on Human Point Clouds: A Benchmark and A Dual Cascade Point Transformer Framework

Fan Zhang, Shuyi Mao, Qing Li, Xiaojiang Peng

TL;DR

This work tackles 3D landmark detection on unordered human point clouds by introducing HPoint103, a real-world dataset with 11 landmarks across 103 scans, and a Dual Cascade Point Transformer (D-CPT) that localizes landmarks directly on point clouds. D-CPT uses an encoder–decoder architecture with multiple cascade Transformer decoders operating over the full point cloud, coupled with a RefineNet that searches local neighborhoods around coarse predictions to refine landmark coordinates. The approach achieves state-of-the-art results on HPoint103 and competitive performance on DHP19, and its RefineNet module can improve other point-based methods as a plug-in. This work advances 3D human landmark detection, enabling more robust downstream applications in 3D pose estimation, head swapping, and virtual try-on in real-world, textured point-cloud data.

Abstract

3D landmark detection plays a pivotal role in various applications such as 3D registration, pose estimation, and virtual try-on. While considerable success has been achieved in 2D human landmark detection or pose estimation, there is a notable scarcity of reported works on landmark detection in unordered 3D point clouds. This paper introduces a novel challenge, namely 3D landmark detection on human point clouds, presenting two primary contributions. Firstly, we establish a comprehensive human point cloud dataset, named HPoint103, designed to support the 3D landmark detection community. This dataset comprises 103 human point clouds created with commercial software and actors, each manually annotated with 11 stable landmarks. Secondly, we propose a Dual Cascade Point Transformer (D-CPT) model for precise point-based landmark detection. D-CPT gradually refines the landmarks through cascade Transformer decoder layers across the entire point cloud stream, simultaneously enhancing landmark coordinates with a RefineNet over local regions. Comparative evaluations with popular point-based methods on HPoint103 and the public dataset DHP19 demonstrate the dramatic outperformance of our D-CPT. Additionally, the integration of our RefineNet into existing methods consistently improves performance.

3D Landmark Detection on Human Point Clouds: A Benchmark and A Dual Cascade Point Transformer Framework

TL;DR

This work tackles 3D landmark detection on unordered human point clouds by introducing HPoint103, a real-world dataset with 11 landmarks across 103 scans, and a Dual Cascade Point Transformer (D-CPT) that localizes landmarks directly on point clouds. D-CPT uses an encoder–decoder architecture with multiple cascade Transformer decoders operating over the full point cloud, coupled with a RefineNet that searches local neighborhoods around coarse predictions to refine landmark coordinates. The approach achieves state-of-the-art results on HPoint103 and competitive performance on DHP19, and its RefineNet module can improve other point-based methods as a plug-in. This work advances 3D human landmark detection, enabling more robust downstream applications in 3D pose estimation, head swapping, and virtual try-on in real-world, textured point-cloud data.

Abstract

3D landmark detection plays a pivotal role in various applications such as 3D registration, pose estimation, and virtual try-on. While considerable success has been achieved in 2D human landmark detection or pose estimation, there is a notable scarcity of reported works on landmark detection in unordered 3D point clouds. This paper introduces a novel challenge, namely 3D landmark detection on human point clouds, presenting two primary contributions. Firstly, we establish a comprehensive human point cloud dataset, named HPoint103, designed to support the 3D landmark detection community. This dataset comprises 103 human point clouds created with commercial software and actors, each manually annotated with 11 stable landmarks. Secondly, we propose a Dual Cascade Point Transformer (D-CPT) model for precise point-based landmark detection. D-CPT gradually refines the landmarks through cascade Transformer decoder layers across the entire point cloud stream, simultaneously enhancing landmark coordinates with a RefineNet over local regions. Comparative evaluations with popular point-based methods on HPoint103 and the public dataset DHP19 demonstrate the dramatic outperformance of our D-CPT. Additionally, the integration of our RefineNet into existing methods consistently improves performance.
Paper Structure (25 sections, 5 equations, 6 figures, 5 tables)

This paper contains 25 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of our dual cascade framework. The horizontal cascade process gradually refines results over the whole human point cloud model, while the vertical cascade process refines each landmark over local regions.
  • Figure 2: The building steps of our HPoint103, include video recording, human matting, point cloud generation, and point annotation.
  • Figure 3: Qualitive comparison between different human landmark datasets. The left part of DHP19 is the point cloud converted by frames, which is far more worse than ours.
  • Figure 4: The pipeline of our proposed network. The input point cloud is first encoded into the point-wise feature. The hierarchical point-wise feature after the cascade decoding process is then transformed into the coarse prediction and sent to the RefineNet. In the refinement process, each landmark is upsampled to $k$ points through kNN search in the region of interest. The input coarse prediction is then refined into the fine prediction.
  • Figure 5: Visualization of predicting locations before and after RefineNet.
  • ...and 1 more figures