SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

Laura Bragagnolo; Leonardo Barcellona; Stefano Ghidoni

SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

Laura Bragagnolo, Leonardo Barcellona, Stefano Ghidoni

TL;DR

SkelSplat reframes multi-view 3D human pose estimation as differentiable Gaussian rendering using a skeleton of joint Gaussians supervised by pseudo 2D heatmaps from multi-view detections. A one-hot joint encoding enables independent joint optimization within the Gaussian Splatting framework, and cross-view gradient accumulation along with a 3D symmetry regularizer improves stability and anatomical coherence without requiring 3D ground-truth data. The method demonstrates state-of-the-art occlusion robustness and strong cross-dataset generalization across Human3.6M, CMU Panoptic Studio, and Occlusion-Person datasets, highlighting its practical potential for diverse deployments without retraining. However, it trades off real-time speed for accuracy and scalability, motivating future work on faster rendering schemes and multi-person extension while preserving generalization advantages.

Abstract

Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.

SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

TL;DR

Abstract

SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)