Table of Contents
Fetching ...

Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull

TL;DR

Sparse camera configurations in filmmaking hamper robust dynamic 3-D reconstruction, especially for reflective, transparent, and dynamically textured content. The paper introduces a foreground-background disentangled dynamic Gaussian Splatting framework that splits canonical Gaussians $G_f$ and $G_b$ using a sparse mask at $t=0$, learns separate hex-plane deformation fields $\Lambda_f$ and $\Lambda_b$, and employs a modified opacity model to capture dynamic textures while using a reference-free densification strategy. Key contributions include mask-based canonical initialization, dual deformation fields aligned with filmmaking practices (background only displacement, foreground full motion and color changes), an opacity-based mechanism for RTD textures, and a densification scheme that reduces background bias and preserves foreground fidelity. Experiments on sparse-view 3-D and 2.5-D entertainment datasets show SotA qualitative and quantitative gains, up to $>3$ PSNR with about half the model size on 3-D scenes, and enable clean foreground segmentation including transparent textures for post-production workflows.

Abstract

Deformable Gaussian Splatting (GS) accomplishes photorealistic dynamic 3-D reconstruction from dense multi-view video (MVV) by learning to deform a canonical GS representation. However, in filmmaking, tight budgets can result in sparse camera configurations, which limits state-of-the-art (SotA) methods when capturing complex dynamic features. To address this issue, we introduce an approach that splits the canonical Gaussians and deformation field into foreground and background components using a sparse set of masks for frames at t=0. Each representation is separately trained on different loss functions during canonical pre-training. Then, during dynamic training, different parameters are modeled for each deformation field following common filmmaking practices. The foreground stage contains diverse dynamic features so changes in color, position and rotation are learned. While, the background containing film-crew and equipment, is typically dimmer and less dynamic so only changes in point position are learned. Experiments on 3-D and 2.5-D entertainment datasets show that our method produces SotA qualitative and quantitative results; up to 3 PSNR higher with half the model size on 3-D scenes. Unlike the SotA and without the need for dense mask supervision, our method also produces segmented dynamic reconstructions including transparent and dynamic textures. Code and video comparisons are available online: https://interims-git.github.io/

Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

TL;DR

Sparse camera configurations in filmmaking hamper robust dynamic 3-D reconstruction, especially for reflective, transparent, and dynamically textured content. The paper introduces a foreground-background disentangled dynamic Gaussian Splatting framework that splits canonical Gaussians and using a sparse mask at , learns separate hex-plane deformation fields and , and employs a modified opacity model to capture dynamic textures while using a reference-free densification strategy. Key contributions include mask-based canonical initialization, dual deformation fields aligned with filmmaking practices (background only displacement, foreground full motion and color changes), an opacity-based mechanism for RTD textures, and a densification scheme that reduces background bias and preserves foreground fidelity. Experiments on sparse-view 3-D and 2.5-D entertainment datasets show SotA qualitative and quantitative gains, up to PSNR with about half the model size on 3-D scenes, and enable clean foreground segmentation including transparent textures for post-production workflows.

Abstract

Deformable Gaussian Splatting (GS) accomplishes photorealistic dynamic 3-D reconstruction from dense multi-view video (MVV) by learning to deform a canonical GS representation. However, in filmmaking, tight budgets can result in sparse camera configurations, which limits state-of-the-art (SotA) methods when capturing complex dynamic features. To address this issue, we introduce an approach that splits the canonical Gaussians and deformation field into foreground and background components using a sparse set of masks for frames at t=0. Each representation is separately trained on different loss functions during canonical pre-training. Then, during dynamic training, different parameters are modeled for each deformation field following common filmmaking practices. The foreground stage contains diverse dynamic features so changes in color, position and rotation are learned. While, the background containing film-crew and equipment, is typically dimmer and less dynamic so only changes in point position are learned. Experiments on 3-D and 2.5-D entertainment datasets show that our method produces SotA qualitative and quantitative results; up to 3 PSNR higher with half the model size on 3-D scenes. Unlike the SotA and without the need for dense mask supervision, our method also produces segmented dynamic reconstructions including transparent and dynamic textures. Code and video comparisons are available online: https://interims-git.github.io/

Paper Structure

This paper contains 24 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Sparse view 3-D reconstruction: Our dynamic representation offers foreground-background separability and high quality 3-D reconstruction without the need for dense mask priors. This paper focuses on filmmaking challenges, including but not limited to sparse view and reflective, transparent and dynamic textures
  • Figure 2: Novel views (right) reveal over-reconstructed backgrounds and under-reconstructed foregrounds in SotA
  • Figure 3: Left: The canonical representation is constructed by masking the initial point cloud and training the foreground and background representations $G_f$ and $G_b$ on specialized loss functions that minimize over-reconstruction. Right: Dynamic features for $G_f$ and $G_b$ are jointly trained using the proposed plane-based design. For $G_b$, we only learn motion. For $G_f$, we learn motion, rotation and color change using a novel combination of plane features, and also temporal opacity using an exponential peaking function.
  • Figure 4: Full and Zoom Temporal Comparison: The zoom results show that our method is the only one capable of capturing the visual dynamics of the semi-transparent key-chain. Using the ViVo-Bassist scene azzarelli2025vivo
  • Figure 5: Per-Frame and Per-View PSNR Plot: The surrounding plots show the PSNR result and objectively demonstrate our approach is consistently performant. Full and Zoom Frame Comparison: Our-Foreground (labeled Ours) reconstructs keyboard, arms and feet with more visual appeal. Using the ViVo-Pianist scene azzarelli2025vivo
  • ...and 2 more figures