Table of Contents
Fetching ...

Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling

Hanyang Kong, Xingyi Yang, Xinchao Wang

TL;DR

Efficient Dynamic Gaussian Splatting (EDGS) tackles the heavy computational burden of rendering dynamic scenes from monocular videos by introducing a sparse, time-variant attribute modeling framework. It uses a sparse anchor-grid initialized from COLMAP, with time-invariant Gaussian attributes decoded by tiny MLPs and time-variant attributes filtered through a time-mask MLP, ensuring only deformable regions are processed each frame. Dynamics are modeled sparsely via an RBF kernel that propagates anchor motions to per-Gaussian offsets, enabling precise yet efficient motion representation. Empirical results on NeRF-DS and HyperNeRF show EDGS achieves higher PSNR/SSIM and significantly faster FPS with far fewer Gaussians than state-of-the-art methods, underscoring its practical value for real-time dynamic scene rendering.

Abstract

Rendering dynamic scenes from monocular videos is a crucial yet challenging task. The recent deformable Gaussian Splatting has emerged as a robust solution to represent real-world dynamic scenes. However, it often leads to heavily redundant Gaussians, attempting to fit every training view at various time steps, leading to slower rendering speeds. Additionally, the attributes of Gaussians in static areas are time-invariant, making it unnecessary to model every Gaussian, which can cause jittering in static regions. In practice, the primary bottleneck in rendering speed for dynamic scenes is the number of Gaussians. In response, we introduce Efficient Dynamic Gaussian Splatting (EDGS), which represents dynamic scenes via sparse time-variant attribute modeling. Our approach formulates dynamic scenes using a sparse anchor-grid representation, with the motion flow of dense Gaussians calculated via a classical kernel representation. Furthermore, we propose an unsupervised strategy to efficiently filter out anchors corresponding to static areas. Only anchors associated with deformable objects are input into MLPs to query time-variant attributes. Experiments on two real-world datasets demonstrate that our EDGS significantly improves the rendering speed with superior rendering quality compared to previous state-of-the-art methods.

Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling

TL;DR

Efficient Dynamic Gaussian Splatting (EDGS) tackles the heavy computational burden of rendering dynamic scenes from monocular videos by introducing a sparse, time-variant attribute modeling framework. It uses a sparse anchor-grid initialized from COLMAP, with time-invariant Gaussian attributes decoded by tiny MLPs and time-variant attributes filtered through a time-mask MLP, ensuring only deformable regions are processed each frame. Dynamics are modeled sparsely via an RBF kernel that propagates anchor motions to per-Gaussian offsets, enabling precise yet efficient motion representation. Empirical results on NeRF-DS and HyperNeRF show EDGS achieves higher PSNR/SSIM and significantly faster FPS with far fewer Gaussians than state-of-the-art methods, underscoring its practical value for real-time dynamic scene rendering.

Abstract

Rendering dynamic scenes from monocular videos is a crucial yet challenging task. The recent deformable Gaussian Splatting has emerged as a robust solution to represent real-world dynamic scenes. However, it often leads to heavily redundant Gaussians, attempting to fit every training view at various time steps, leading to slower rendering speeds. Additionally, the attributes of Gaussians in static areas are time-invariant, making it unnecessary to model every Gaussian, which can cause jittering in static regions. In practice, the primary bottleneck in rendering speed for dynamic scenes is the number of Gaussians. In response, we introduce Efficient Dynamic Gaussian Splatting (EDGS), which represents dynamic scenes via sparse time-variant attribute modeling. Our approach formulates dynamic scenes using a sparse anchor-grid representation, with the motion flow of dense Gaussians calculated via a classical kernel representation. Furthermore, we propose an unsupervised strategy to efficiently filter out anchors corresponding to static areas. Only anchors associated with deformable objects are input into MLPs to query time-variant attributes. Experiments on two real-world datasets demonstrate that our EDGS significantly improves the rendering speed with superior rendering quality compared to previous state-of-the-art methods.

Paper Structure

This paper contains 27 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Given a set of monocular multi-view images and camera poses, our method achieves real-time rendering for dynamic scenes while maintaining high rendering quality. (b) Our method achieves promising rendering quality with faster rendering speed and fewer Gaussians. The radius of the circle is the number of time-variant Gaussians whose attributes need to be queried by MLPs. (c) The bottleneck of the rendering speed for dynamic scenes is the number of Gaussians. The fewer the number of Gaussians, the faster the rendering speed.
  • Figure 2: The pipeline of our EDGS. 1) We first initialize voxelized sparse anchor points from Structure from Motion (SfM) points derived from COLMAP. 2) A time-mask MLP is applied to classify if the anchor belongs to the static area or the deformable area. 3) $k$ Gaussian offsets are initialized for each anchor $\boldsymbol{a}$ belonging to static area. The time-invariant attributes of each Gaussian, i.e., opacity, quaternion, scale, and color are calculated from its feature by corresponding tiny MLPs. 4) Time-variant attributes for anchors from dynamic areas are decoded by a deformable attribute MLP. RBF kernel function is employed to compute the location of each Gaussian at timestep $t$ by calculating the similarity between each Gaussian and its belonging anchor point. This pipeline is compact and efficient, featuring only a few tiny MLPs for the attributes of the Gaussians and a single network for deformations. Notably, the position of each anchor remains static and is not subject to updates.
  • Figure 3: Qualitative comparison on the NeRF-DS dataset yan2023nerfds. Compared with other SOTA methods, our method reconstructs finer details and produces a structured rendering of the moving objects, e.g., the cup on human's hand.
  • Figure 4: Qualitative comparison on the HyperNeRF dataset park2021hypernerf. Our EDGS reconstructs detailed texture and reliable structure compared with other SOTA methods.
  • Figure 5: Visuazization of the difference map (diff.) and the optical flow with fixed camera views. We synthesis fixed novel view across time for yang2023deformablewu20234dgs and ours. The $1^{st}$ row is the rendered frames at various time steps. The $2^{nd}$ and $3^{rd}$ rows are the difference map between $t^{th}$ frame and the $1^{st}$ frame and the optical flow, respectively. The response in the highlighted red area indicates that the static area rendered by deformable GS and 4DGS is jittering. Our method achieves better quality for static and dynamic objects.
  • ...and 1 more figures