Table of Contents
Fetching ...

Swift4D:Adaptive divide-and-conquer Gaussian Splatting for compact and efficient reconstruction of dynamic scene

Jiahao Wu, Rui Peng, Zhiyan Wang, Lu Xiao, Luyang Tang, Jinbo Yan, Kaiqiang Xiong, Ronggang Wang

TL;DR

Swift4D tackles dynamic scene novel view synthesis by dividing Gaussian splats into dynamic and static components and applying temporal modeling only to the dynamic subset. A compact 4DHash-based spatio-temporal encoder paired with a multi-head deformation decoder models deformations of dynamic Gaussians, while temporal pruning removes floaters and mitigates coupling between canonical and deformation spaces. The method achieves state-of-the-art rendering quality with significantly reduced training time (often minutes) and storage (as low as $30$ MB) on real-world datasets, demonstrating fast convergence and practicality for dynamic scenes. The approach offers a plug-and-play module for existing dynamic methods and emphasizes efficient allocation of compute to genuinely dynamic regions, enabling scalable 4D reconstruction.

Abstract

Novel view synthesis has long been a practical but challenging task, although the introduction of numerous methods to solve this problem, even combining advanced representations like 3D Gaussian Splatting, they still struggle to recover high-quality results and often consume too much storage memory and training time. In this paper we propose Swift4D, a divide-and-conquer 3D Gaussian Splatting method that can handle static and dynamic primitives separately, achieving a good trade-off between rendering quality and efficiency, motivated by the fact that most of the scene is the static primitive and does not require additional dynamic properties. Concretely, we focus on modeling dynamic transformations only for the dynamic primitives which benefits both efficiency and quality. We first employ a learnable decomposition strategy to separate the primitives, which relies on an additional parameter to classify primitives as static or dynamic. For the dynamic primitives, we employ a compact multi-resolution 4D Hash mapper to transform these primitives from canonical space into deformation space at each timestamp, and then mix the static and dynamic primitives to produce the final output. This divide-and-conquer method facilitates efficient training and reduces storage redundancy. Our method not only achieves state-of-the-art rendering quality while being 20X faster in training than previous SOTA methods with a minimum storage requirement of only 30MB on real-world datasets. Code is available at https://github.com/WuJH2001/swift4d.

Swift4D:Adaptive divide-and-conquer Gaussian Splatting for compact and efficient reconstruction of dynamic scene

TL;DR

Swift4D tackles dynamic scene novel view synthesis by dividing Gaussian splats into dynamic and static components and applying temporal modeling only to the dynamic subset. A compact 4DHash-based spatio-temporal encoder paired with a multi-head deformation decoder models deformations of dynamic Gaussians, while temporal pruning removes floaters and mitigates coupling between canonical and deformation spaces. The method achieves state-of-the-art rendering quality with significantly reduced training time (often minutes) and storage (as low as MB) on real-world datasets, demonstrating fast convergence and practicality for dynamic scenes. The approach offers a plug-and-play module for existing dynamic methods and emphasizes efficient allocation of compute to genuinely dynamic regions, enabling scalable 4D reconstruction.

Abstract

Novel view synthesis has long been a practical but challenging task, although the introduction of numerous methods to solve this problem, even combining advanced representations like 3D Gaussian Splatting, they still struggle to recover high-quality results and often consume too much storage memory and training time. In this paper we propose Swift4D, a divide-and-conquer 3D Gaussian Splatting method that can handle static and dynamic primitives separately, achieving a good trade-off between rendering quality and efficiency, motivated by the fact that most of the scene is the static primitive and does not require additional dynamic properties. Concretely, we focus on modeling dynamic transformations only for the dynamic primitives which benefits both efficiency and quality. We first employ a learnable decomposition strategy to separate the primitives, which relies on an additional parameter to classify primitives as static or dynamic. For the dynamic primitives, we employ a compact multi-resolution 4D Hash mapper to transform these primitives from canonical space into deformation space at each timestamp, and then mix the static and dynamic primitives to produce the final output. This divide-and-conquer method facilitates efficient training and reduces storage redundancy. Our method not only achieves state-of-the-art rendering quality while being 20X faster in training than previous SOTA methods with a minimum storage requirement of only 30MB on real-world datasets. Code is available at https://github.com/WuJH2001/swift4d.

Paper Structure

This paper contains 16 sections, 10 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Our method demonstrates high-quality rendering, rapid convergence, and compact storage characteristics. It can achieve competitive result with just 5 minutes of training. Additionally, with increased training iterations, our method excels in handling finer details.
  • Figure 2: Illustration of different dynamic scene rendering methods. (a) pumarola2021dpark2021nerfies proposes mapping deformation field points to canonical space, a widely adopted practice in NeRF-based methods; (b) wu20244dyang2024deformable propose mapping canonical space points to the deformation field; (c) We propose dividing the points in canonical space into dynamic and static, and then mapping only the dynamic points to the deformation space.
  • Figure 3: Pipeline of our Swift4D. First, we use the first frame images to obtain a well-initialized canonical point cloud. Then, we train the dynamic parameter $d$ according to the method described in Sec.\ref{['sec:segmentation']}. Based on $d$, the point cloud is divided into dynamic and static categories. Dynamic points undergo deformation using a spatio-temporal structure, as discussed in Sec.\ref{['sec:4dhash']}. Finally, the deformed dynamic points are mixed with static points for rendering.
  • Figure 4:
  • Figure 5:
  • ...and 10 more figures