Table of Contents
Fetching ...

Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

Xiaoqi An, Lin Zhao, Chen Gong, Jun Li, Jian Yang

TL;DR

This work tackles robust LiDAR-only 3D human pose estimation in the presence of noisy and sparse point clouds. It introduces a density-aware pose transformer (DAPT) that uses learnable joint anchors and a multi-density exchange mechanism to produce stable 3D keypoint heatmaps, alleviating sensitivity to point density. A comprehensive LiDAR human synthesis and augmentation pipeline provides rich priors via ray casting and laser-level masking, enabling effective pre-training that transfers to real data. The combined approach achieves state-of-the-art results across multiple datasets, reducing MPJPE by notable margins and demonstrating strong stability under occlusion, noise, and sparsity, with practical implications for outdoor and autonomous driving scenarios.

Abstract

With the rapid development of autonomous driving, LiDAR-based 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by $10.0mm$. Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by $20.7mm$.

Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

TL;DR

This work tackles robust LiDAR-only 3D human pose estimation in the presence of noisy and sparse point clouds. It introduces a density-aware pose transformer (DAPT) that uses learnable joint anchors and a multi-density exchange mechanism to produce stable 3D keypoint heatmaps, alleviating sensitivity to point density. A comprehensive LiDAR human synthesis and augmentation pipeline provides rich priors via ray casting and laser-level masking, enabling effective pre-training that transfers to real data. The combined approach achieves state-of-the-art results across multiple datasets, reducing MPJPE by notable margins and demonstrating strong stability under occlusion, noise, and sparsity, with practical implications for outdoor and autonomous driving scenarios.

Abstract

With the rapid development of autonomous driving, LiDAR-based 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by . Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by .

Paper Structure

This paper contains 31 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We propose a novel framework for LiDAR-based 3D HPE. With comprehensive LiDAR human synthesis & augmentation for model pre-training as well as the learning of stable joint representations, our method produces robust results from low-quality point clouds.
  • Figure 2: Difficulties of LiDAR-based 3D HPE, mainly lie in samples with noisy, occluded, or sparse point clouds.
  • Figure 3: Results of the segment-regression-based methods. The color of the point indicates which joint it belongs to. The wrong segmentation leads to a biased joint location.
  • Figure 4: Overall structure of our method. It mainly consists of (a) a comprehensive LiDAR human synthesis and augmentation framework to provide internal human priors, and (b) a Density Aware Pose Transformer that uses the multi-density exchange (MDE) module to extract stable joint representations from point cloud features.
  • Figure 5: Visualization results on common datasets with challenging samples. Our method demonstrates strong stability on low-quality point clouds including occlusion, noise, and sparsity.
  • ...and 3 more figures