Table of Contents
Fetching ...

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

TL;DR

The paper tackles dynamic 4D reconstruction without training by leveraging a training-free adaptation of DUSt3R, Easi3R, which analyzes cross-attention maps to disentangle object and camera motion. It derives dynamic segmentation from attention cues and performs a second inference pass with attention re-weighting to produce robust 4D reconstructions and pose estimates without fine-tuning on dynamic data. The approach achieves state-of-the-art or competitive results on dynamic segmentation (DAVIS) and camera pose benchmarks (DyCheck, ADT, TUM-dynamics), and demonstrates strong 4D reconstruction performance on DyCheck, all with minimal additional cost. The work suggests that pre-trained 3D reconstruction models inherently encode motion structure that can be exploited for dynamic tasks, potentially guiding attention-based methods in other domains as well.

Abstract

Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

TL;DR

The paper tackles dynamic 4D reconstruction without training by leveraging a training-free adaptation of DUSt3R, Easi3R, which analyzes cross-attention maps to disentangle object and camera motion. It derives dynamic segmentation from attention cues and performs a second inference pass with attention re-weighting to produce robust 4D reconstructions and pose estimates without fine-tuning on dynamic data. The approach achieves state-of-the-art or competitive results on dynamic segmentation (DAVIS) and camera pose benchmarks (DyCheck, ADT, TUM-dynamics), and demonstrates strong 4D reconstruction performance on DyCheck, all with minimal additional cost. The work suggests that pre-trained 3D reconstruction models inherently encode motion structure that can be exploited for dynamic tasks, potentially guiding attention-based methods in other domains as well.

Abstract

Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/

Paper Structure

This paper contains 16 sections, 16 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: We present Easi3R, a training-free, plug-and-play approach that efficiently disentangles object and camera motion, enabling the adaptation of DUSt3R for 4D reconstruction.
  • Figure 2: DUSt3R with Dynamic Video. We process videos using a sliding window and infer the DUSt3R network pairwise. Reconstruction degrades with misalignment when dynamic objects occupy a considerable portion of the frames.
  • Figure 3: DUSt3R and our Easi3R adaptation. DUSt3R encodes two images $I^a,I^b$ into feature tokens $\mathbf{F}_0^a,\mathbf{F}_0^b$, which are then decoded into point maps in the reference view coordinate space using two decoders. Our Easi3R aggregates the cross-attention maps from the decoders, producing four semantically meaningful maps: $\mathbf{A}^{b=\text{src}}_\mu,\mathbf{A}^{b=\text{src}}_\sigma,\mathbf{A}^{a=\text{ref}}_\mu,\mathbf{A}^{a=\text{ref}}_\sigma$. These maps are then used for a second inference pass to enhance reconstruction quality.
  • Figure 4: Visualization for Cross-Attention Maps. We color the normalized values of attention maps, ranging from onetozero. We highlight the patterns captured by each type of attention map using relatively high values. For a more detailed demonstration, we invite reviewers to visit our webpage under https://easi3r.github.io/.
  • Figure 5: Qualitative Results of Dynamic Object Segmentation. "Ours" refers to the $\hbox{Easi3R}\xspace_\text{monst3r}$ setting. Here, we present the enhanced setting, where outputs from different methods serve as prompts and are used with SAM2 sam2 for mask inference.
  • ...and 8 more figures