RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction

Feiran Wang; Zezhou Shang; Gaowen Liu; Yan Yan

RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction

Feiran Wang, Zezhou Shang, Gaowen Liu, Yan Yan

Abstract

Streaming feed-forward 3D reconstruction enables real-time joint estimation of scene geometry and camera poses from RGB images. However, without explicit dynamic reasoning, streaming models can be affected by moving objects, causing artifacts and drift. In this work, we propose RayMap3R, a training-free streaming framework for dynamic scene reconstruction. We observe that RayMap-based predictions exhibit a static-scene bias, providing an internal cue for dynamic identification. Based on this observation, we construct a dual-branch inference scheme that identifies dynamic regions by contrasting RayMap and image predictions, suppressing their interference during memory updates. We further introduce reset metric alignment and state-aware smoothing to preserve metric consistency and stabilize predicted trajectories. Our method achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.

RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction

Abstract

Paper Structure (15 sections, 6 equations, 8 figures, 9 tables)

This paper contains 15 sections, 6 equations, 8 figures, 9 tables.

Introduction
Related Work
Methods
RayMap and Memory Mechanism
Static Bias of RayMap Predictions
Dynamic Identification via RayMap Remap
Reset Metric Alignment
State-Aware Smoothing
Experiments
Video Depth Estimation
Camera Pose Estimation
3D Reconstruction
Qualitative Results
Analysis
Conclusion

Figures (8)

Figure 1: Streaming 3D Reconstruction for Dynamic Scenes. RayMap3R leverages the static bias of RayMap-only predictions to identify and suppress dynamic regions at inference time, reducing camera drift and improving geometry fidelity.
Figure 2: RayMap Static Bias.Left: The main prediction and RayMap prediction produce differing depth estimates in dynamic regions, suggesting a static bias in RayMap predictions. Right: The depth discrepancy correlates with ground-truth dynamic ratio across multiple datasets, suggesting a consistent signal for dynamic identification.
Figure 3: Dynamic Map Visualization. The dynamic map is the per-pixel depth difference between the main branch and RayMap-only predictions, closely aligning with ground-truth dynamic masks across both synthetic and real scenes.
Figure 4: Method Overview. Our method performs streaming 3D reconstruction via dual-branch inference. At time step $t$, the main branch predicts depth and pose from state $s_{t-1}$ using both image and RayMap features. The predicted pose $\hat{\mathbf{T}}_t$ is remapped into RayMap and encoded into tokens $r'_t$, then queried against the frozen state $s_{t-1}$ by the RayMap branch to obtain a static-biased prediction. The depth difference between branches is projected via image-state attention to form staticness weights $\alpha_t$ that suppress dynamic regions during state update ($s_t = s_{t-1} + \alpha_t \odot \Delta s_t$). This dual-branch scheme identifies and suppresses dynamic regions at inference time.
Figure 5: Qualitative Results on DAVIS perazzi2016benchmark Videos. We compare our method with CUT3R cut3r and TTT3R ttt3r. Our method achieves more stable camera pose estimation and produces clearer reconstructions.
...and 3 more figures

RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction

Abstract

RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction

Authors

Abstract

Table of Contents

Figures (8)