Table of Contents
Fetching ...

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun

TL;DR

This work presents LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization, and achieves robust, globally consistent reconstruction over unprecedented horizons.

Abstract

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

TL;DR

This work presents LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization, and achieves robust, globally consistent reconstruction over unprecedented horizons.

Abstract

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.
Paper Structure (24 sections, 12 equations, 14 figures, 6 tables)

This paper contains 24 sections, 12 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Our proposed method and visual comparison. For very long videos, we advocate chunk-based processing where bidirectional attention handles intra-chunk reasoning with inter-chunk alignment handled by our proposed hybrid memory module, composing Sliding Window Attention (SWA) for detailed local memory and Test-Time Training (TTT) for compressed global context. LoGeR shows clear improvement over prior methods on expansive VBR brizi2024vbr dataset, yielding superior loop closures and finer geometric details.
  • Figure 2: Overview of a single block of our hybrid memory module. We process the input sequence in consecutive chunks of frames. While each block utilizes frame and bidirectional attention from prior work, we introduce new components to effectively propagate information across the entire sequence. Specifically, we incorporate Sliding Window Attention to improve consistency between neighboring chunks, and Test-Time Training layers to maintain long-range, global consistency across all chunks.
  • Figure 3: Comparison of different methods across varying sequence lengths and scene scales. Although FastVGGT is able to process a larger number of frames during inference, it fails completely on large-scale scenes, highlighting the inherent "data wall" of models trained strictly on short-context bubbles. In contrast, LoGeR breaks both the context and data walls by pairing a hybrid memory architecture with diverse long-horizon training data.
  • Figure 4: Quantitative results on our proposed VBRbrizi2024vbr evaluation showing results on very long sequences spanning from 1,000 to 19,000 frames. Our methods achieve 30.8% more accurate results than prior methods.
  • Figure 5: Qualitative camera trajectories on the VBR dataset. LoGeR accurately preserves global scale and trajectory over very long sequences, closely matching the ground truth where prior methods suffer from severe drift.
  • ...and 9 more figures