Table of Contents
Fetching ...

Detail Enhanced Gaussian Splatting for Large-Scale Volumetric Capture

Julien Philip, Li Ma, Pascal Clausen, Wenqi Xian, Ahmet Levent Taşel, Mingming He, Xueming Yu, David M. George, Ning Yu, Oliver Pilarski, Paul Debevec

TL;DR

This work introduces a two-rig, large-scale 4D volumetric capture pipeline combining Poly4DGS dynamic Gaussian splatting with a diffusion-based detail enhancement to produce production-quality 4K facial closeups. It tackles the gap between scalable scene capture and high-resolution rendering by (i) capturing multi-actor performances with a Scene Rig, (ii) capturing actor-specific facial detail with a Face Rig, and (iii) training a diffusion model on paired low/high-quality GS data to add fine details and restore alpha. The approach achieves improved temporal stability and detail fidelity, validated through ablations and comparisons, enabling realistic free-viewpoint video suitable for film and television production. However, the method requires substantial hardware and computing resources, and some temporal artifacts persist, motivating future work in relighting and improved eye reflections.

Abstract

We present a unique system for large-scale, multi-performer, high resolution 4D volumetric capture providing realistic free-viewpoint video up to and including 4K resolution facial closeups. To achieve this, we employ a novel volumetric capture, reconstruction and rendering pipeline based on Dynamic Gaussian Splatting and Diffusion-based Detail Enhancement. We design our pipeline specifically to meet the demands of high-end media production. We employ two capture rigs: the Scene Rig, which captures multi-actor performances at a resolution which falls short of 4K production quality, and the Face Rig, which records high-fidelity single-actor facial detail to serve as a reference for detail enhancement. We first reconstruct dynamic performances from the Scene Rig using 4D Gaussian Splatting, incorporating new model designs and training strategies to improve reconstruction, dynamic range, and rendering quality. Then to render high-quality images for facial closeups, we introduce a diffusion-based detail enhancement model. This model is fine-tuned with high-fidelity data from the same actors recorded in the Face Rig. We train on paired data generated from low- and high-quality Gaussian Splatting (GS) models, using the low-quality input to match the quality of the Scene Rig, with the high-quality GS as ground truth. Our results demonstrate the effectiveness of this pipeline in bridging the gap between the scalable performance capture of a large-scale rig and the high-resolution standards required for film and media production.

Detail Enhanced Gaussian Splatting for Large-Scale Volumetric Capture

TL;DR

This work introduces a two-rig, large-scale 4D volumetric capture pipeline combining Poly4DGS dynamic Gaussian splatting with a diffusion-based detail enhancement to produce production-quality 4K facial closeups. It tackles the gap between scalable scene capture and high-resolution rendering by (i) capturing multi-actor performances with a Scene Rig, (ii) capturing actor-specific facial detail with a Face Rig, and (iii) training a diffusion model on paired low/high-quality GS data to add fine details and restore alpha. The approach achieves improved temporal stability and detail fidelity, validated through ablations and comparisons, enabling realistic free-viewpoint video suitable for film and television production. However, the method requires substantial hardware and computing resources, and some temporal artifacts persist, motivating future work in relighting and improved eye reflections.

Abstract

We present a unique system for large-scale, multi-performer, high resolution 4D volumetric capture providing realistic free-viewpoint video up to and including 4K resolution facial closeups. To achieve this, we employ a novel volumetric capture, reconstruction and rendering pipeline based on Dynamic Gaussian Splatting and Diffusion-based Detail Enhancement. We design our pipeline specifically to meet the demands of high-end media production. We employ two capture rigs: the Scene Rig, which captures multi-actor performances at a resolution which falls short of 4K production quality, and the Face Rig, which records high-fidelity single-actor facial detail to serve as a reference for detail enhancement. We first reconstruct dynamic performances from the Scene Rig using 4D Gaussian Splatting, incorporating new model designs and training strategies to improve reconstruction, dynamic range, and rendering quality. Then to render high-quality images for facial closeups, we introduce a diffusion-based detail enhancement model. This model is fine-tuned with high-fidelity data from the same actors recorded in the Face Rig. We train on paired data generated from low- and high-quality Gaussian Splatting (GS) models, using the low-quality input to match the quality of the Scene Rig, with the high-quality GS as ground truth. Our results demonstrate the effectiveness of this pipeline in bridging the gap between the scalable performance capture of a large-scale rig and the high-resolution standards required for film and media production.

Paper Structure

This paper contains 30 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Pipeline Overview. Actors perform in the Scene Rig, where full-body performances are captured. Using our Poly4DGS framework, we reconstruct the Performance. The same actors are then captured in the Face Rig. We generate Poly4DGS models for a portion of their facial performance: a high-quality model (HQGS, 4M Gaussians) and a low-quality model (LQGS, 50K-200K Gaussians). These reconstructions are used to train an Image Enhancement Module which refines the renderings of the low-quality GS to be like the high quality one. Finally, the trained model is used to enhance renderings from the 4DGS performance. Please refer to our supplementary video for the final composition of 4K render results.
  • Figure 2: The Scene Rig captures suffer from severe lens glare for some cameras.
  • Figure 3: Architectural changes made to the base Flux model flux. Starting from the Latent Diffusion Architecture (top left), we add input channels to condition the network (bottom left). To improve temporal stability and generate an alpha channel, we condition our model on the previous warped output, a validity mask, and both the LQ RGB and Alpha. We also double the size of the latent space to predict RGB and Alpha jointly (right).
  • Figure 4: Illustration of the Face Rig and corresponding reconstructions. From left to right: the Face Rig hardware, an input image, a low-quality GS render and the corresponding high-quality GS render used for supervision.
  • Figure 5: Top: input training views captured in our Scene Rig. Bottom-left: our Poly4DGS reconstruction with insets. Bottom-right: final results using our super-resolution module and compositing. We can observe that the Gaussian artifacts present at extreme zoom levels are effectively removed and new details added.
  • ...and 2 more figures