TTT3R: 3D Reconstruction as Test-Time Training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
TL;DR
TTT3R reframes long-sequence 3D reconstruction as a test-time online learning problem, deriving a confidence-guided, per-token learning rate to update a fixed memory state without fine-tuning. The method plug-and-play improves length generalization for CUT3R, achieving strong online performance across pose, depth, and geometry tasks while maintaining real-time, memory-efficient inference. Across camera pose, video depth, and 3D reconstruction benchmarks, TTT3R matches or nears offline full-attention methods in online settings and demonstrates robust long-sequence capabilities. Limitations include residual forgetting on very long sequences and a gap to offline baselines, with State Reset proposed as an appendix-based mitigation and future TT-learning explorations suggested.
Abstract
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
