Table of Contents
Fetching ...

TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction

DaDong Jiang, Zhihui Ke, Xiaobo Zhou, Zhi Hou, Xianghui Yang, Wenbo Hu, Tie Qiu, Chunchao Guo

TL;DR

TimeFormer tackles robust dynamic scene reconstruction with deformable 3D Gaussians by introducing a Cross-Temporal Transformer Encoder that implicitly learns cross-timestamp motion patterns. A two-stream optimization transfers this learned motion to the base deformation field, allowing TimeFormer to be omitted during inference to preserve rendering speed. Across multi-view and monocular datasets, TimeFormer yields consistent qualitative and quantitative gains, improves FPS, and enables more efficient canonical-space distributions, particularly in scenes with violent motion or reflective surfaces. The work demonstrates that global temporal relationships learned via attention can outperform local, per-timestamp motion modeling, offering a practical, plug-and-play enhancement for state-of-the-art deformable 3D Gaussian methods.

Abstract

Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces. To address the above issue, we design a plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective. Specifically, TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians. Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed. Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer. Project Page: https://patrickddj.github.io/TimeFormer/

TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction

TL;DR

TimeFormer tackles robust dynamic scene reconstruction with deformable 3D Gaussians by introducing a Cross-Temporal Transformer Encoder that implicitly learns cross-timestamp motion patterns. A two-stream optimization transfers this learned motion to the base deformation field, allowing TimeFormer to be omitted during inference to preserve rendering speed. Across multi-view and monocular datasets, TimeFormer yields consistent qualitative and quantitative gains, improves FPS, and enables more efficient canonical-space distributions, particularly in scenes with violent motion or reflective surfaces. The work demonstrates that global temporal relationships learned via attention can outperform local, per-timestamp motion modeling, offering a practical, plug-and-play enhancement for state-of-the-art deformable 3D Gaussian methods.

Abstract

Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces. To address the above issue, we design a plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective. Specifically, TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians. Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed. Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer. Project Page: https://patrickddj.github.io/TimeFormer/

Paper Structure

This paper contains 18 sections, 12 equations, 12 figures, 6 tables, 2 algorithms.

Figures (12)

  • Figure 1: TimeFormer guides towards more efficiently distributed canonical space, showing higher FPS and better quality. The results are from "cut lemon" in the HyperNeRF Dataset park2021hypernerf.
  • Figure 2: The Framework of Deformable 3D Gaussians Reconstruction with TimeFormer. Existing deformable 3D Gaussians framework usually includes the canonical space and the deformation field (first row), we incorporate TimeFormer to capture cross-time relationships and explore motion patterns implicitly (second row). We share weights of two deformation fields to transfer the learned motion knowledge. This allows us to exclude this Auxiliary Training Module during inference.
  • Figure 3: The Structure of Cross-Temporal Encoder. We concatenate randomly sampled timestamps to position $\mu$, treating them as special tokens in a sequence. This module is designed to model multi-temporal relationships and produce distinct time-variant position offsets $\Delta p_0, \dots, \Delta p_{B-1}$.
  • Figure 4: Data Flow Changes in the Deformation Field. Dashed lines represent new data flow among time samples $t_0, \dots,t_{B-1}$.
  • Figure 5: Visualization of Comparisons on N3DV Dataset li2022nv3d.
  • ...and 7 more figures