Table of Contents
Fetching ...

HUGS: Human Gaussian Splats

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, Anurag Ranjan

TL;DR

<3-5 sentence high-level summary>HUGS tackles animatable human avatars in real-world scenes from minimal monocular video by representing both the human and the scene with 3D Gaussian Splatting. It initializes Gaussians from SMPL and learns a canonical-space deformation with Linear Blend Skinning (LBS) weights, enabling efficient novel-pose and novel-view synthesis with 60 FPS rendering and ~30 minutes of training. The method achieves state-of-the-art reconstruction quality on NeuMan and ZJU-Mocap while delivering orders-of-magnitude faster training and rendering than prior NeRF-based approaches. By explicitly disentangling human and scene Gaussians and leveraging a lightweight triplane-MLP backbone, HUGS delivers fast, high-fidelity animated avatars suitable for in-the-wild monocular video data, with clear paths for extending clothing and illumination modeling in future work.

Abstract

Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g. cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges, including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being ~100x faster to train over previous work. Our code will be announced here: https://github.com/apple/ml-hugs

HUGS: Human Gaussian Splats

TL;DR

<3-5 sentence high-level summary>HUGS tackles animatable human avatars in real-world scenes from minimal monocular video by representing both the human and the scene with 3D Gaussian Splatting. It initializes Gaussians from SMPL and learns a canonical-space deformation with Linear Blend Skinning (LBS) weights, enabling efficient novel-pose and novel-view synthesis with 60 FPS rendering and ~30 minutes of training. The method achieves state-of-the-art reconstruction quality on NeuMan and ZJU-Mocap while delivering orders-of-magnitude faster training and rendering than prior NeRF-based approaches. By explicitly disentangling human and scene Gaussians and leveraging a lightweight triplane-MLP backbone, HUGS delivers fast, high-fidelity animated avatars suitable for in-the-wild monocular video data, with clear paths for extending clothing and illumination modeling in future work.

Abstract

Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g. cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges, including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being ~100x faster to train over previous work. Our code will be announced here: https://github.com/apple/ml-hugs
Paper Structure (37 sections, 5 equations, 12 figures, 4 tables)

This paper contains 37 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Human Gaussian Splats (HUGS) is a neural rendering framework that trains on 50-100 frames of a monocular video containing a human in a scene. HUGS enables novel view rendering with novel human poses at 60 FPS by learning a disentangled representation that can also render the human in other scenes.
  • Figure 2: HUGS overview. Given a video with dynamic human and camera motions, HUGS recovers an animatable human avatar and synthesizes human and scene from novel view points. Our method represents both the human and the scene as 3D Gaussians. The human Gaussians are parameterized by their mean locations in a canonical space and the features from a triplane. Three MLPs are used to estimate their color, opacity, additional shift, rotation, scale, and LBS weights to animate the Gaussians with given joint configurations. The human and the scene Gaussians are combined and rendered together with splatting.
  • Figure 3: Qualitative results comparing HUGS (ours) with NeuMan and Vid2Avatar with full human (left) and zoomed-in regions (right) for each of the methods. HUGS shows better reconstruction quality especially around hands, feet and clothing wrinkles.
  • Figure 4: Rendering obtained by transferring the Human Gaussians to a different scene. Top-left corner shows the original scene in which the human was captured.
  • Figure 5: Visualization of Human in canonical Da-pose for HUGS (ours) showing qualitative improvements over NeuMan jiang2022neuman.
  • ...and 7 more figures