Table of Contents
Fetching ...

GSAC: Leveraging Gaussian Splatting for Photorealistic Avatar Creation with Unity Integration

Rendong Zhang, Alexandra Watkins, Nilanjan Sarkar

TL;DR

This work introduces GSAC, an end-to-end Gaussian Splatting avatar pipeline that converts monocular video into a photorealistic, riggable SMPL-X-based avatar compatible with Unity. It combines a preprocessing stage (SMPL-X/DECA/mmpose-based estimation and missing-hand handling), a GS training stage (Gaussian splats bound to SMPL-X polygons with targeted losses), and a Unity Editor for real-time rendering and animation via GPU-based splats. The approach achieves faster preprocessing and competitive visual quality (PSNR/SSIM/LPIPS) with real-time performance in Unity (FPS > 60) and supports VR/AR application development, while acknowledging artifacts in unobserved regions and limitations in cloth dynamics. These findings demonstrate a practical, scalable path toward accessible, photorealistic, animatable avatars for immersive VR/AR experiences and interactive training scenarios.

Abstract

Photorealistic avatars have become essential for immersive applications in virtual reality (VR) and augmented reality (AR), enabling lifelike interactions in areas such as training simulations, telemedicine, and virtual collaboration. These avatars bridge the gap between the physical and digital worlds, improving the user experience through realistic human representation. However, existing avatar creation techniques face significant challenges, including high costs, long creation times, and limited utility in virtual applications. Manual methods, such as MetaHuman, require extensive time and expertise, while automatic approaches, such as NeRF-based pipelines often lack efficiency, detailed facial expression fidelity, and are unable to be rendered at a speed sufficent for real-time applications. By involving several cutting-edge modern techniques, we introduce an end-to-end 3D Gaussian Splatting (3DGS) avatar creation pipeline that leverages monocular video input to create a scalable and efficient photorealistic avatar directly compatible with the Unity game engine. Our pipeline incorporates a novel Gaussian splatting technique with customized preprocessing that enables the user of "in the wild" monocular video capture, detailed facial expression reconstruction and embedding within a fully rigged avatar model. Additionally, we present a Unity-integrated Gaussian Splatting Avatar Editor, offering a user-friendly environment for VR/AR application development. Experimental results validate the effectiveness of our preprocessing pipeline in standardizing custom data for 3DGS training and demonstrate the versatility of Gaussian avatars in Unity, highlighting the scalability and practicality of our approach.

GSAC: Leveraging Gaussian Splatting for Photorealistic Avatar Creation with Unity Integration

TL;DR

This work introduces GSAC, an end-to-end Gaussian Splatting avatar pipeline that converts monocular video into a photorealistic, riggable SMPL-X-based avatar compatible with Unity. It combines a preprocessing stage (SMPL-X/DECA/mmpose-based estimation and missing-hand handling), a GS training stage (Gaussian splats bound to SMPL-X polygons with targeted losses), and a Unity Editor for real-time rendering and animation via GPU-based splats. The approach achieves faster preprocessing and competitive visual quality (PSNR/SSIM/LPIPS) with real-time performance in Unity (FPS > 60) and supports VR/AR application development, while acknowledging artifacts in unobserved regions and limitations in cloth dynamics. These findings demonstrate a practical, scalable path toward accessible, photorealistic, animatable avatars for immersive VR/AR experiences and interactive training scenarios.

Abstract

Photorealistic avatars have become essential for immersive applications in virtual reality (VR) and augmented reality (AR), enabling lifelike interactions in areas such as training simulations, telemedicine, and virtual collaboration. These avatars bridge the gap between the physical and digital worlds, improving the user experience through realistic human representation. However, existing avatar creation techniques face significant challenges, including high costs, long creation times, and limited utility in virtual applications. Manual methods, such as MetaHuman, require extensive time and expertise, while automatic approaches, such as NeRF-based pipelines often lack efficiency, detailed facial expression fidelity, and are unable to be rendered at a speed sufficent for real-time applications. By involving several cutting-edge modern techniques, we introduce an end-to-end 3D Gaussian Splatting (3DGS) avatar creation pipeline that leverages monocular video input to create a scalable and efficient photorealistic avatar directly compatible with the Unity game engine. Our pipeline incorporates a novel Gaussian splatting technique with customized preprocessing that enables the user of "in the wild" monocular video capture, detailed facial expression reconstruction and embedding within a fully rigged avatar model. Additionally, we present a Unity-integrated Gaussian Splatting Avatar Editor, offering a user-friendly environment for VR/AR application development. Experimental results validate the effectiveness of our preprocessing pipeline in standardizing custom data for 3DGS training and demonstrate the versatility of Gaussian avatars in Unity, highlighting the scalability and practicality of our approach.

Paper Structure

This paper contains 17 sections, 17 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed method, which consists of three main stages: (1) Data Preprocessing – A 1080 × 1080 frame is processed using state-of-the-art models to optimize SMPL-X parameters. (2) Gaussian Splats Training – The top row illustrates the initialization of Gaussians, while the bottom row shows the rendered images during training. For each frame, first deform the gaussian splats based on SMPL-X parameter, and then rasterized gaussians to get rendered images. (3) Unity Editor Viewing – The trained avatar is initially in a T-pose, with an a pose option provided for user-driven animation.
  • Figure 2: Comparison of rendered results for each subject. (a) Full-body visualizations show that both our method and HAHA produce shapes closely matching the ground truth. (b) Cropped views of facial expressions demonstrate that our method achieves higher fidelity in capturing facial details compared to HAHA.
  • Figure 3: Qualitative results on a volunteer's video input, illustrating key steps of our pipeline. (a) Resized image frame(1080x1080) from input frame captured using an iPhone 12 Pro (1440 × 1440 resolution). (b) Visualization of estimated SMPL-X parameters. (c) Initialized Gaussians (d) Gaussian rendering while training (e) Final trained Gaussian splats avatar in Unity, presented in an A-pose.
  • Figure 4: Qualitative results of our Unity viewer, demonstrating three different poses—T-pose, A-pose, and a novel pose with a simple facial expression—for (a) male4 and (b) female3 from the PeopleSnapshot dataset.
  • Figure 5: Demonstration of Gaussian avatars animated with Unity’s default animation system. (a) Avatar of Male4 animated with selected frames of dance animation, obtained from the Unity Asset Store. (b) Avatar of Female4 animated with selected frames of another dance animation