Table of Contents
Fetching ...

MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, Yu-Shen Liu

TL;DR

MoRe is a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos and employs an attention-forcing strategy to disentangle dynamic motion from static structure, ensuring temporally coherent geometry reconstruction.

Abstract

Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.

MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

TL;DR

MoRe is a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos and employs an attention-forcing strategy to disentangle dynamic motion from static structure, ensuring temporally coherent geometry reconstruction.

Abstract

Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.
Paper Structure (32 sections, 12 equations, 10 figures, 7 tables)

This paper contains 32 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: We propose MoRe, a motion-aware 4D reconstruction transformer that explicitly disentangles dynamic motion from static scene structure. This capability is enabled by our attention-forcing training strategy, which guides the model to separate motion cues from background geometry. At inference time, More further supports streaming inputs through its grouped causal attention design.
  • Figure 2: Method Overview. During training, an attention-forcing mechanism aligns the attention weights with ground-truth motion masks, enabling the model to effectively disentangle dynamic motion from static scene structure. For streaming reconstruction task, MoRe is based on a causal transformer where global attention is replaced by aggregated causal attention.
  • Figure 3: Attention Map Visualization. We visualize the attention map of the camera token within VGGT wang2025vggt and observe that the model tends to confuse moving objects with static background regions, which accounts for the degradation in prediction accuracy.
  • Figure 4: Grouped Causal Attention. Unlike traditional causal attention, our design allows image tokens within the same frame to attend to each other regardless of their ordering. This formulation enables the model to preserve causal temporal reasoning while maintaining spatial consistency within each frame.
  • Figure 5: Streaming Inference pipeline. Leveraging causal attention, our model can efficiently process streaming input in an online manner. To enhance camera pose accuracy, we apply a bundle-adjustment-like post-processing step after the entire sequence has been processed. Specifically, for each frame, we duplicate the camera token and perform inference again using the previously cached key-value pairs.
  • ...and 5 more figures