Table of Contents
Fetching ...

4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

Hidenobu Matsuki, Gwangbin Bae, Andrew J. Davison

TL;DR

4DTAM introduces the first end-to-end 4D tracking and mapping method that jointly estimates camera motion, geometry, appearance, and non-rigid scene dynamics from a single RGB-D stream using differentiable rendering. The approach relies on 2D Gaussian Splatting as an explicit surface representation and an MLP-based warp-field to model time-varying deformations, combined with analytic camera pose Jacobians to enable real-time optimization. A comprehensive Sim4D synthetic dataset, ground-truth benchmarks, and open-source rendering tools are provided to support evaluation of 4D dynamic reconstruction. Empirical results show state-of-the-art performance in both tracking and 4D surface reconstruction, with robust handling of articulated and non-rigid objects, and a favorable trade-off between reconstruction quality and rendering speed. The work enables practical, dynamic-scene understanding for robotics and AR in scenarios with moving objects and non-rigid motions.

Abstract

We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. Our approach captures 4D scenes from an online stream of color images with depth measurements or predictions by jointly optimizing scene geometry, appearance, dynamics, and camera ego-motion. Although natural environments exhibit complex non-rigid motions, 4D-SLAM remains relatively underexplored due to its inherent challenges; even with 2.5D signals, the problem is ill-posed because of the high dimensionality of the optimization space. To overcome these challenges, we first introduce a SLAM method based on Gaussian surface primitives that leverages depth signals more effectively than 3D Gaussians, thereby achieving accurate surface reconstruction. To further model non-rigid deformations, we employ a warp-field represented by a multi-layer perceptron (MLP) and introduce a novel camera pose estimation technique along with surface regularization terms that facilitate spatio-temporal reconstruction. In addition to these algorithmic challenges, a significant hurdle in 4D SLAM research is the lack of reliable ground truth and evaluation protocols, primarily due to the difficulty of 4D capture using commodity sensors. To address this, we present a novel open synthetic dataset of everyday objects with diverse motions, leveraging large-scale object models and animation modeling. In summary, we open up the modern 4D-SLAM research by introducing a novel method and evaluation protocols grounded in modern vision and rendering techniques.

4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

TL;DR

4DTAM introduces the first end-to-end 4D tracking and mapping method that jointly estimates camera motion, geometry, appearance, and non-rigid scene dynamics from a single RGB-D stream using differentiable rendering. The approach relies on 2D Gaussian Splatting as an explicit surface representation and an MLP-based warp-field to model time-varying deformations, combined with analytic camera pose Jacobians to enable real-time optimization. A comprehensive Sim4D synthetic dataset, ground-truth benchmarks, and open-source rendering tools are provided to support evaluation of 4D dynamic reconstruction. Empirical results show state-of-the-art performance in both tracking and 4D surface reconstruction, with robust handling of articulated and non-rigid objects, and a favorable trade-off between reconstruction quality and rendering speed. The work enables practical, dynamic-scene understanding for robotics and AR in scenarios with moving objects and non-rigid motions.

Abstract

We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. Our approach captures 4D scenes from an online stream of color images with depth measurements or predictions by jointly optimizing scene geometry, appearance, dynamics, and camera ego-motion. Although natural environments exhibit complex non-rigid motions, 4D-SLAM remains relatively underexplored due to its inherent challenges; even with 2.5D signals, the problem is ill-posed because of the high dimensionality of the optimization space. To overcome these challenges, we first introduce a SLAM method based on Gaussian surface primitives that leverages depth signals more effectively than 3D Gaussians, thereby achieving accurate surface reconstruction. To further model non-rigid deformations, we employ a warp-field represented by a multi-layer perceptron (MLP) and introduce a novel camera pose estimation technique along with surface regularization terms that facilitate spatio-temporal reconstruction. In addition to these algorithmic challenges, a significant hurdle in 4D SLAM research is the lack of reliable ground truth and evaluation protocols, primarily due to the difficulty of 4D capture using commodity sensors. To address this, we present a novel open synthetic dataset of everyday objects with diverse motions, leveraging large-scale object models and animation modeling. In summary, we open up the modern 4D-SLAM research by introducing a novel method and evaluation protocols grounded in modern vision and rendering techniques.

Paper Structure

This paper contains 46 sections, 25 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: 4DTAM jointly estimates camera-egomotion, appearance, geometry and scene dynamics without any template.
  • Figure 2: Method overview of 4DTAM.
  • Figure 3: 2D Gaussian's Surface Normal Rendering based on Different Initialization. Left: Random initialization. Right: Our initialization aligned with sensor measurement.
  • Figure 4: Sim4D dataset. We create a new dataset for 4D reconstruction by rendering animated 3D meshes.
  • Figure 5: Qualitative comparison to SurfelWarp. Left: Rendered image, Middle: Rendered normal map, Right: Estimated camera trajectory
  • ...and 4 more figures