Table of Contents
Fetching ...

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, Xiaowei Zhou

Abstract

This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Abstract

This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.

Paper Structure

This paper contains 23 sections, 15 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of Scal3R. Our model takes a long sequence of RGB images as input and reconstructs the 3D scene within a unified inference pipeline. Specifically, the input sequence is divided into overlapping chunks that are processed in parallel across multiple GPUs. Each chunk is processed by our Scal3R backbone, which incorporates our proposed neural global context representation and aggregation mechanism to capture and share global context across the entire sequence. The resulting camera poses and depth maps from all chunks are then aligned and fused to generate the final 3D reconstruction of the scene.
  • Figure 2: Camera trajectory comparison on KITTI Odometry Geiger2012CVPR and Oxford Spires tao2025spires. Scal3R preserves global structure with lower drift, whereas baselines often lose tracking or diverge on long sequences.
  • Figure 3: Qualitative comparison of point-cloud reconstruction on outdoor and indoor scenes. Scal3R reconstructs large-scale outdoor scenes more reliably and preserves more consistent local geometry indoors.
  • Figure 4: Camera trajectory comparison. Scal3R preserves global structure with substantially lower drift, whereas baselines frequently lose tracking or diverge, demonstrating our capability of reconstructing large-scale scenarios with high accuracy.
  • Figure 5: Point-cloud reconstruction comparison. Scal3R produces more accurate large-scale reconstructions for large-scale outdoor environments where baselines often fail, and achieves higher local geometric accuracy and consistency in indoor scenes.
  • ...and 1 more figures