Table of Contents
Fetching ...

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TL;DR

Point3R introduces an online streaming framework for dense 3D reconstruction using an explicit spatial pointer memory that binds memory to 3D coordinates. It adds a 3D hierarchical position embedding and a memory fusion mechanism to enable efficient, scalable integration of new frames into a growing global coordinate system. The approach demonstrates competitive or state-of-the-art performance across dense reconstruction, monocular/video depth estimation, and camera pose tasks with low training cost, and shows robustness to long sequences and unordered inputs. Ablation studies validate the contributions of the pointer memory, 3D position embedding, and fusion strategy. This work offers a practical, interpretable memory mechanism for online 3D scene understanding in dynamic environments.

Abstract

Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code: https://github.com/YkiWu/Point3R.

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

TL;DR

Point3R introduces an online streaming framework for dense 3D reconstruction using an explicit spatial pointer memory that binds memory to 3D coordinates. It adds a 3D hierarchical position embedding and a memory fusion mechanism to enable efficient, scalable integration of new frames into a growing global coordinate system. The approach demonstrates competitive or state-of-the-art performance across dense reconstruction, monocular/video depth estimation, and camera pose tasks with low training cost, and shows robustness to long sequences and unordered inputs. Ablation studies validate the contributions of the pointer memory, 3D position embedding, and fusion strategy. This work offers a practical, interpretable memory mechanism for online 3D scene understanding in dynamic environments.

Abstract

Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code: https://github.com/YkiWu/Point3R.

Paper Structure

This paper contains 20 sections, 16 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparison between our explicit spatial pointer memory and other paradigms in dense 3D reconstruction. Methods that conduct all-to-all interaction among all inputs simultaneously yang2025fast3rwang2025vggtvisualgeometrygrounded can be considered as using other frames as memory (for one of the inputs). Methods that cache encoded features of processed frames and conduct token-image interaction wang20243d can be considered as using past frames as memory. Methods maintaining a fixed-length state memory and conducting state-image interaction cut3r can be considered as using implicit state memory. We propose an explicit spatial pointer memory in which each pointer is assigned a 3D position and points to a changing spatial feature. We conduct a pointer-image interaction to integrate new observations into the global coordinate system and update our spatial pointer memory accordingly.
  • Figure 2: Overview of Point3R. Given streaming image inputs, our method maintains an explicit spatial pointer memory to store the observed information of the current scene. We use a ViT dosovitskiy2021anwang2024dust3r encoder to encode the current input into image tokens and use ViT-based decoders to conduct interaction between image tokens and spatial features in the memory. We use two DPT ranftl21dpt heads to decode local and global pointmaps from the output image tokens. Besides, a learnable pose token is added during this stage so we can directly decode the camera parameters of the current frame. Then we use a simple memory encoder to encode the current input and its integrated output into new pointers, and use a memory fusion mechanism to enrich and update our spatial pointer memory.
  • Figure 3: Qualitative results on sparse inputs from the 7-scenes and NRGBD datasets. Our method achieves the best qualitative results among memory-based methods.
  • Figure 4: Changes on the total number of pointers and per-frame runtime with memory fusion.
  • Figure 5: Qualitative results on dense inputs from static scenes.
  • ...and 1 more figures