Table of Contents
Fetching ...

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, Zhaoyang Lv

TL;DR

4DGT introduces a 4D Gaussian Transformer that learns dynamic scene reconstruction from real-world monocular videos in a purely feed-forward manner. By unifying static and dynamic content through 4D Gaussian Splatting and a lifespan-aware representation, it handles long-range video sequences via rolling windows. The method employs density control (pruning and densification) and multi-level spatiotemporal attention to manage computational costs, enabling real-time rendering and scalable training on real data. Empirically, 4DGT achieves competitive or superior quality to optimization-based baselines while offering orders-of-magnitude faster inference and better cross-domain generalization when trained on diverse monocular datasets.

Abstract

We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We proposed a novel density control strategy in training, which enables our 4DGT to handle longer space-time input and remain efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can outperform prior Gaussian-based networks significantly in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos. Project page: https://4dgt.github.io

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

TL;DR

4DGT introduces a 4D Gaussian Transformer that learns dynamic scene reconstruction from real-world monocular videos in a purely feed-forward manner. By unifying static and dynamic content through 4D Gaussian Splatting and a lifespan-aware representation, it handles long-range video sequences via rolling windows. The method employs density control (pruning and densification) and multi-level spatiotemporal attention to manage computational costs, enabling real-time rendering and scalable training on real data. Empirically, 4DGT achieves competitive or superior quality to optimization-based baselines while offering orders-of-magnitude faster inference and better cross-domain generalization when trained on diverse monocular datasets.

Abstract

We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We proposed a novel density control strategy in training, which enables our 4DGT to handle longer space-time input and remain efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can outperform prior Gaussian-based networks significantly in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos. Project page: https://4dgt.github.io

Paper Structure

This paper contains 24 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We propose a scalable 4D dynamic reconstruction model trained only on real-world monocular RGB videos. The feed-forward 4DGS (\ref{['sec:4dgs']}) representation enables us to render the geometry and appearance of the dynamic scene from novel views in real-time. Even without explicit supervision, the model can learn to distinguish dynamic contents from the background and produce realistic optical flows. The figure shows an enlarged set of Gaussians for the purpose of visualization. The embedded rendered videos only play in Adobe Reader or KDE Okular.
  • Figure 2: An overview of our method in training and rendering. 4DGT takes a series of monocular frames with poses as input. During training, we subsample the temporal frames at different granularity and use all images in training. We first train 4DGT to predict pixel-aligned Gaussians at coarse resolution in stage one. In stage two training, we pruned a majority of non-activated Gaussians according to the histograms of per-patch activation channels, and densify the Gaussian prediction by increasing the input token samples in both space and time. At inference time, we run the 4DGT network trained after stage two. It can support dense video frames input at high resolution.
  • Figure 3: From left-to-right, we show the novel space-time view comparisons on ADT pan2023aria, EgoExo4D grauman2024ego, DyCheck gao2022monocular and the DyCheck test-view (rightmost). We render the depth (upper right) and normal (below right) next to each synthesized novel view. For ground truth depth and normal on EgoExo4D and DyCheck, we use predictions from the expert models from the ground truth image for reference. Please refer to the appendix for more visual comparisons.
  • Figure 4: The predicted opacity map ($\in R^{N \times H \times W}$) of the pixel-aligned dynamic Gaussians from 4DGT and the computed histogram ($\in R^{p \times p}$) of the activation distribution. The right section shows the difference between histogram thresholding (Ours) and other filtering methods (randomly or uniformly selecting the Gaussians to keep) for reducing the number of Gaussians.
  • Figure 5: Ablation study on proposed components.