Table of Contents
Fetching ...

SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

Laura Bragagnolo, Leonardo Barcellona, Stefano Ghidoni

TL;DR

SkelSplat reframes multi-view 3D human pose estimation as differentiable Gaussian rendering using a skeleton of joint Gaussians supervised by pseudo 2D heatmaps from multi-view detections. A one-hot joint encoding enables independent joint optimization within the Gaussian Splatting framework, and cross-view gradient accumulation along with a 3D symmetry regularizer improves stability and anatomical coherence without requiring 3D ground-truth data. The method demonstrates state-of-the-art occlusion robustness and strong cross-dataset generalization across Human3.6M, CMU Panoptic Studio, and Occlusion-Person datasets, highlighting its practical potential for diverse deployments without retraining. However, it trades off real-time speed for accuracy and scalability, motivating future work on faster rendering schemes and multi-person extension while preserving generalization advantages.

Abstract

Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.

SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

TL;DR

SkelSplat reframes multi-view 3D human pose estimation as differentiable Gaussian rendering using a skeleton of joint Gaussians supervised by pseudo 2D heatmaps from multi-view detections. A one-hot joint encoding enables independent joint optimization within the Gaussian Splatting framework, and cross-view gradient accumulation along with a 3D symmetry regularizer improves stability and anatomical coherence without requiring 3D ground-truth data. The method demonstrates state-of-the-art occlusion robustness and strong cross-dataset generalization across Human3.6M, CMU Panoptic Studio, and Occlusion-Person datasets, highlighting its practical potential for diverse deployments without retraining. However, it trades off real-time speed for accuracy and scalability, motivating future work on faster rendering schemes and multi-person extension while preserving generalization advantages.

Abstract

Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.

Paper Structure

This paper contains 43 sections, 9 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: SkelSplat 3D Gaussian joints are optimized (red) by aligning renderings with 2D detection heatmaps (green arrows).
  • Figure 2: Overview of the SkelSplat framework. Given multi-view images and 2D pose detections, we initialize a skeleton of 3D Gaussians, one per human joint. Pseudo ground truth heatmaps are generated from the 2D detections and used to supervise the optimization, which refines the Gaussians by minimizing a differentiable loss between heatmaps and Gaussian renderings.
  • Figure 3: One-hot encoding, each joint rendered to its channel.
  • Figure 4: SkelSplat results (blue) on Human3.6M-Occ-3 with ground-truth poses (green). The rightmost column shows a failure case under occlusion, where the left knee is incorrectly predicted.
  • Figure 5: Ablation on robustness to poor initialization, adding Gaussian noise to triangulated joints.
  • ...and 4 more figures