Table of Contents
Fetching ...

TivNe-SLAM: Dynamic Mapping and Tracking via Time-Varying Neural Radiance Fields

Chengyao Duan, Zhiliu Yang

TL;DR

TivNe-SLAM tackles dynamic SLAM by introducing a time-varying implicit representation that augments 3D space with time to form 4D space–time coordinates and uses a deformation field to map to a canonical field at $t=0$. Colors and SDF are regressed by a pair of MLPs conditioned on time, with a TriLerp-based embedding and a two-stage optimization to jointly track poses and map dynamic objects. A novel overlap-based keyframe selection maximizes view coverage, enabling more complete dynamic reconstructions while maintaining real-time performance and avoiding pre-trained models. Evaluations on synthetic Room4 and ToyCar3 and real Teddy datasets show competitive tracking accuracy and superior dynamic-object reconstruction, with substantially faster training than RoDynRF. These results highlight the practical impact of 4D neural implicit representations for robust, real-time dynamic-SLAM in real-world environments.

Abstract

Previous attempts to integrate Neural Radiance Fields (NeRF) into the Simultaneous Localization and Mapping (SLAM) framework either rely on the assumption of static scenes or require the ground truth camera poses, which impedes their application in real-world scenarios. This paper proposes a time-varying representation to track and reconstruct the dynamic scenes. Firstly, two processes, a tracking process and a mapping process, are maintained simultaneously in our framework. In the tracking process, all input images are uniformly sampled and then progressively trained in a self-supervised paradigm. In the mapping process, we leverage motion masks to distinguish dynamic objects from the static background, and sample more pixels from dynamic areas. Secondly, the parameter optimization for both processes is comprised of two stages: the first stage associates time with 3D positions to convert the deformation field to the canonical field. The second stage associates time with the embeddings of the canonical field to obtain colors and a Signed Distance Function (SDF). Lastly, we propose a novel keyframe selection strategy based on the overlapping rate. Our approach is evaluated on two synthetic datasets and one real-world dataset, and the experiments validate that our method achieves competitive results in both tracking and mapping when compared to existing state-of-the-art NeRF-based dynamic SLAM systems.

TivNe-SLAM: Dynamic Mapping and Tracking via Time-Varying Neural Radiance Fields

TL;DR

TivNe-SLAM tackles dynamic SLAM by introducing a time-varying implicit representation that augments 3D space with time to form 4D space–time coordinates and uses a deformation field to map to a canonical field at . Colors and SDF are regressed by a pair of MLPs conditioned on time, with a TriLerp-based embedding and a two-stage optimization to jointly track poses and map dynamic objects. A novel overlap-based keyframe selection maximizes view coverage, enabling more complete dynamic reconstructions while maintaining real-time performance and avoiding pre-trained models. Evaluations on synthetic Room4 and ToyCar3 and real Teddy datasets show competitive tracking accuracy and superior dynamic-object reconstruction, with substantially faster training than RoDynRF. These results highlight the practical impact of 4D neural implicit representations for robust, real-time dynamic-SLAM in real-world environments.

Abstract

Previous attempts to integrate Neural Radiance Fields (NeRF) into the Simultaneous Localization and Mapping (SLAM) framework either rely on the assumption of static scenes or require the ground truth camera poses, which impedes their application in real-world scenarios. This paper proposes a time-varying representation to track and reconstruct the dynamic scenes. Firstly, two processes, a tracking process and a mapping process, are maintained simultaneously in our framework. In the tracking process, all input images are uniformly sampled and then progressively trained in a self-supervised paradigm. In the mapping process, we leverage motion masks to distinguish dynamic objects from the static background, and sample more pixels from dynamic areas. Secondly, the parameter optimization for both processes is comprised of two stages: the first stage associates time with 3D positions to convert the deformation field to the canonical field. The second stage associates time with the embeddings of the canonical field to obtain colors and a Signed Distance Function (SDF). Lastly, we propose a novel keyframe selection strategy based on the overlapping rate. Our approach is evaluated on two synthetic datasets and one real-world dataset, and the experiments validate that our method achieves competitive results in both tracking and mapping when compared to existing state-of-the-art NeRF-based dynamic SLAM systems.
Paper Structure (22 sections, 10 equations, 8 figures, 4 tables)

This paper contains 22 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of the reconstruction of a dynamic object from our TivNe-SLAM. We introduce a dynamic SLAM system capable of camera tracking and reconstruction of moving objects. Here, we show our 3D reconstruction result, at different time stamps, of a scene from the Room4 dataset runz2017co, a flying ship is moving from the right to the left side of the room. The status of the flying ship is successfully captured and precisely reconstructed.
  • Figure 2: Overview of our TivNe-SLAM framework. Our system simultaneously maintains two processes, a tracking process and a mapping process. Tracking Process: It firstly initializes a map utilizing the 1st frame and initializes the camera pose of each frame. Valid points of the current frame are sampled, encoded by a set of embeddings collection, and tri-linearly interpolated. The interpolated results are fed into two MLPs, and colors and SDF are predicted to render RGB images and depth images. Tracking loss is correspondingly constructed, and mapping parameters are frozen for tracking process, only the pose of the current frame is optimized. Mapping Process: After obtaining the camera pose of the current frame by the tracking process, we design a strategy to select target keyframes from a incrementally-growing database for reconstruction. Then it leverages Mask R-CNN he2017mask to obtain the mask segmentation of dynamic objects. As with the tracking process, points sampling, embedding, interpolation, and MLP regression are executed to obtain colors and SDF, and they are used to reconstruct the meshes. The poses, embedding parameters, and MLPs are optimized in the mapping process.
  • Figure 3: Architecture of our neural deformation field and canonical field. We maintain a sparse embedding collection $\textbf{E} \in \mathbb{R}^{H \times Q}$, which is a collection of Q-Dimensional vectors. We then encode positions $\mathbf{x}^i_j \in \mathbb{R}^3$ as $\mathbf{e}^i_j(t) \in \mathbb{R}^Q$. $\mathbf{e}^i_j(t)$ is associated with time $t$ to regress offsets $\Delta \mathbf{e}^i_j(t)$ by $\Theta_{d}$. Colors and SDFs are obtained by feeding $(\mathbf{e}^i_j(t) + \Delta \mathbf{e}^i_j(t), t)$ to $\Theta_{s}$.
  • Figure 4: Comparison of Mesh Quality among Two NeRF-based SLAMs and Two keyframe-selection Variations of Our TivNe-SLAM. Mapping results of different datasets are interpreted row by row. (1) Room4-1: We completely reconstruct the flying ship. However, NICE-SLAM zhu2022nice and Vox-Fusion yang2022vox are unable to capture it. Randomly selecting keyframes generates unstable shapes and brings gaps into the reconstructed objects. Overlap-based TivNe-SLAM handles this problem well. (2) Room4-2: The results indicate that NICE-SLAM cannot reconstruct the blue car at all, and Vox-Fusion can only occasionally reconstruct the dynamic car, and fails to eliminate residual reconstruction in history positions. Similarly, our method with overlap-based strategy generates the best results. (3) ToyCar3: Only our method reconstructs the white car on the right side of the scene, but the others treat it as an outlier. Additionally, the reflected light on the plane's wings is gradually shifted to the left wing in the input image, but our method still effectively captures this transitional process. (4) Teddy: It is clear that neither NICE-SLAM nor Vox-Fusion can reconstruct the scene of this real-world dynamic dataset. The method of exploiting randomly-selected keyframes yields poor results. However, our method based on overlapping selection fully reconstructs the dynamic teddy, arms of a person and the static background. (The brightness of images of the Teddy dataset is slightly adjusted for clearer visualization.)
  • Figure 5: Results Comparison between Rendered Images of Dynamic NeRF. Our TivNe-SLAM precisely renders images that closely resemble the input images on all three scenes. However RoDynRF fails to completely render dynamic objects in the first two scenes with fast-moving views. For the third scene, in which the camera moves more slowly, RoDynRF renders better result for both dynamic objects and the static background by adding the training time.
  • ...and 3 more figures