LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

Zhiyu Pan; Zhicheng Zhong; Wenxuan Guo; Yifan Chen; Jianjiang Feng; Jie Zhou

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

Zhiyu Pan, Zhicheng Zhong, Wenxuan Guo, Yifan Chen, Jianjiang Feng, Jie Zhou

TL;DR

LiCamPose addresses robust single-frame 3D human pose estimation from multi-view LiDAR and RGB inputs by fusing sparse point clouds with RGB heatmaps in a unified volumetric space. It combines a top-down detection (PointPillars) with a voxel-based fusion network and Soft-argmax for 3D joint estimation, trained via synthetic data from SyncHuman and an unsupervised domain adaptation framework that uses entropy-based pseudo labels and a human-prior loss. The approach achieves competitive MPJPE and PA-MPJPE on Panoptic Studio and BasketBallSync, and demonstrates strong generalization to a real basketball scenario, validating cross-domain transfer in diverse scenes. By providing SyncHuman and an unsupervised training framework, LiCamPose reduces the need for manual 3D pose labels while offering practical multi-modal, multi-view 3D pose estimation capabilities.

Abstract

Several methods have been proposed to estimate 3D human pose from multi-view images, achieving satisfactory performance on public datasets collected under relatively simple conditions. However, there are limited approaches studying extracting 3D human skeletons from multimodal inputs, such as RGB and point cloud data. To address this gap, we introduce LiCamPose, a pipeline that integrates multi-view RGB and sparse point cloud information to estimate robust 3D human poses via single frame. We demonstrate the effectiveness of the volumetric architecture in combining these modalities. Furthermore, to circumvent the need for manually labeled 3D human pose annotations, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy to train a 3D human pose estimator without manual annotations. To validate the generalization capability of our method, LiCamPose is evaluated on four datasets, including two public datasets, one synthetic dataset, and one challenging self-collected dataset named BasketBall, covering diverse scenarios. The results demonstrate that LiCamPose exhibits great generalization performance and significant application potential. The code, generator, and datasets will be made available upon acceptance of this paper.

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

TL;DR

Abstract

Paper Structure (26 sections, 13 equations, 10 figures, 6 tables)

This paper contains 26 sections, 13 equations, 10 figures, 6 tables.

Introduction
Related Works
3D Human Pose Estimation
Synthetic Dataset Generation
Unsupervised Domain Adaption Training
Methodology
3D Human Pose Estimation
Unsupervised Domain Adaptation
SyncHuman Generator
Unsupervised Domain Adaptation
Experiments
Implementation details.
Datasets and Metrics.
3D Pose Estimation Analysis
Unsupervised Domain Adaption
...and 11 more sections

Figures (10)

Figure 1: The LiCamPose pipeline for extracting 3D poses, as exemplified by the BasketBall dataset, involves pretraining on synthetic data from SyncHuman, followed by detecting and tracking individuals, and finally using unsupervised domain adaptation to estimate poses.
Figure 2: The detailed structure of LiCamPose in 3D human pose estimation and its corresponding losses calculations.
Figure 3: Three examples of 3D human pose estimation on MVOR. Blue lines represent predictions, green lines represent ground truth. The first three columns show 2D projections from different views, and the fourth column shows the 3D pose results.
Figure 4: Qualitative illustration on the BasketBall dataset from different input modalities. The first row shows 2D pose estimations (missing where not estimable) and point clouds. The second row displays results from using only RGB input, with 2D poses projected from the estimated 3D poses. The third row presents results from using both RGB and point cloud inputs.
Figure 5: Qualitative visualization on BasketBall about different unsupervised training losses. "Baseline" uses only pseudo 2D pose supervision. "Entropy" indicates the addition of entropy-selected pseudo 3D pose supervision. "Prior" denotes the incorporation of human prior loss.
...and 5 more figures

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

TL;DR

Abstract

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)