Table of Contents
Fetching ...

GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Lorenza Prospero, Abdullah Hamdi, Joao F. Henriques, Christian Rupprecht

TL;DR

GST presents a diffusion-free, Transformer-based pipeline that reconstructs a precise 3D human body from a single image by predicting SMPL parameters and a Gaussian Splatting scene anchored to SMPL vertices. It leverages multi-view supervision to bypass expensive 3D ground truth and diffusion priors, achieving near real-time inference suitable for sports contexts. The method jointly optimizes pose and appearance through a robust loss combining image reconstruction, perceptual similarity, and a tightness regularization that keeps Gaussians aligned with SMPL. Extensive experiments across multiple datasets demonstrate accurate 3D shape estimation, competitive novel-view synthesis, and scalability to large-scale subject diversity (TH21), highlighting practical impact for performance analysis and training in sports settings.

Abstract

Reconstructing posed 3D human models from monocular images has important applications in the sports industry, including performance tracking, injury prevention and virtual training. In this work, we combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians. This allows training or fine-tuning a human model predictor on multi-view images alone, without 3D ground truth. Predicting such mixtures for a human from a single input image is challenging due to self-occlusions and dependence on articulations, while also needing to retain enough flexibility to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate spatial density and approximate initial position for the Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other 3DGS attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve near real-time inference of 3D human models from a single image without expensive diffusion models or 3D points supervision, thus making it ideal for the sport industry at any level. More importantly, rendering is an effective auxiliary objective to refine 3D pose estimation by accounting for clothes and other geometric variations. The code is available at https://github.com/prosperolo/GST.

GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

TL;DR

GST presents a diffusion-free, Transformer-based pipeline that reconstructs a precise 3D human body from a single image by predicting SMPL parameters and a Gaussian Splatting scene anchored to SMPL vertices. It leverages multi-view supervision to bypass expensive 3D ground truth and diffusion priors, achieving near real-time inference suitable for sports contexts. The method jointly optimizes pose and appearance through a robust loss combining image reconstruction, perceptual similarity, and a tightness regularization that keeps Gaussians aligned with SMPL. Extensive experiments across multiple datasets demonstrate accurate 3D shape estimation, competitive novel-view synthesis, and scalability to large-scale subject diversity (TH21), highlighting practical impact for performance analysis and training in sports settings.

Abstract

Reconstructing posed 3D human models from monocular images has important applications in the sports industry, including performance tracking, injury prevention and virtual training. In this work, we combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians. This allows training or fine-tuning a human model predictor on multi-view images alone, without 3D ground truth. Predicting such mixtures for a human from a single input image is challenging due to self-occlusions and dependence on articulations, while also needing to retain enough flexibility to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate spatial density and approximate initial position for the Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other 3DGS attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve near real-time inference of 3D human models from a single image without expensive diffusion models or 3D points supervision, thus making it ideal for the sport industry at any level. More importantly, rendering is an effective auxiliary objective to refine 3D pose estimation by accounting for clothes and other geometric variations. The code is available at https://github.com/prosperolo/GST.
Paper Structure (19 sections, 4 equations, 15 figures, 6 tables)

This paper contains 19 sections, 4 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Example of human pose improvements using our method GST. 3D human body results of our GST and SMPL predictions of HMR2 4dhumans on a sports sequence from the CMU panoptic dome dataset cmu.
  • Figure 2: Overview of the pipeline of Gaussian Splatting Transformer (GST). Given a single input image, GST uses a Vision Transformer (ViT) to predict both the 3D human pose (in the form of SMPL parameters) and a refined full-color 3D model (in the form of 3D Gaussian Splats). Additional input tokens are used to predict each Gaussian color $\mathbf{c}$, opacity $\alpha$, scale, rotation, and position offset $\boldsymbol{\delta}$. Every Gaussian position $\boldsymbol{\mu}$ is tied to one vertex of the SMPL model $\mathbf{v}$ by the offset $\boldsymbol{\delta}$.
  • Figure 3: 3D Shape Comparison with HMR2. 3D human body results of our GST on two subjects of HuMMan humman dataset compared to Ground Truth SMPL parameters SMPL, and SMPL predictions of HMR2 4dhumans.
  • Figure 4: 3D Shape Comparison with HMR2 After Fine-tuning on 2D and 3D Data. 3D human body results of our GST on two subjects of HuMMan humman dataset compared to Ground Truth SMPL parameters SMPL, and SMPL predictions of HMR2 4dhumans. We show two versions of HMR2, one finetuned on 2D data only (HMR2-2D), and one finetuned on 3D data (HMR2-3D). Our method is only finetuned on 2D image data, but the results are almost as accurate as HMR2 finetuned on 3D data.
  • Figure 5: Single Image NVS. GST on 2 subjects of HuMMan humman dataset compared to Ground Truth renderings, and SHERF SHERF (after being adapted with HMR2 to work with single image input only). GST depicts the correct human pose (compared with ground truth).
  • ...and 10 more figures