Table of Contents
Fetching ...

MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, Yan Wang

Abstract

Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining $O(1)$ inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300--500 frame streams on 7-Scenes. The code is available at https://dongjiacheng06.github.io/MeMix/

MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

Abstract

Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300--500 frame streams on 7-Scenes. The code is available at https://dongjiacheng06.github.io/MeMix/
Paper Structure (13 sections, 20 equations, 9 figures, 7 tables)

This paper contains 13 sections, 20 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: MeMix. A training-free, plug-and-play state-update module for recurrent streaming 3D reconstruction. MeMix recasts the recurrent state as a mixture of memory patches, updates Bottom-k patches and preserves the rest. This reduces interference and improves long-horizon stability with $O(1)$ inference memory.
  • Figure 2: Where to write: Mixture Memory Updates. (a) CUT3R overwrites all state tokens at every timestep. (b) TTT3R applies a dense per-token gate to modulate how much to write, but still updates every token. (c--d) MeMix enables where-to-write updates via Mixture Memory: only a subset of memory patches/tokens are written while the rest are exactly preserved, and it can be plugged into CUT3R (c) or combined with TTT-style gating (d). Colored token squares $\blacksquare\!\blacksquare\!\blacksquare$ indicate tokens that are progressively reinforced over time.
  • Figure 3: Overview of MeMix. A ViT encoder encodes each frame to tokens $\mathbf{X}_t$, which interact with state tokens $\mathbf{S}_{t-1}$ through a dual-stream cross-attention decoder to produce predictions $\mathbf{Y}_t$ and candidate state $\hat{\mathbf{S}}_t$. MeMix computes dot scores between $\hat{\mathbf{S}}_t$ and ${\mathbf{X}_t}$, selects Bottom-k patches to construct a binary mask $\mathbf{M}_t$, updating only Bottom-K patches. Decoded image tokens $\mathbf{Y}_t$ are fed to the prediction head for output.
  • Figure 4: Qualitative results of 3D reconstruction. We compare CUT3R, TTT3R, and TTSA3R with their MeMix variants on long input streams. MeMix consistently improves reconstruction quality by reducing drift and recovering more complete, sharper surfaces. Red boxes highlight representative regions where MeMix corrects failures such as surface tearing, missing geometry, and ghosting.
  • Figure 5: Long-sequence pose estimation on TUM and ScanNet. We compare CUT3R, TTT3R, and TTSA3R with their MeMix variants on long input streams, and report the absolute trajectory error (ATE) as the number of input views increases.
  • ...and 4 more figures