Table of Contents
Fetching ...

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

TL;DR

TTT3R reframes long-sequence 3D reconstruction as a test-time online learning problem, deriving a confidence-guided, per-token learning rate to update a fixed memory state without fine-tuning. The method plug-and-play improves length generalization for CUT3R, achieving strong online performance across pose, depth, and geometry tasks while maintaining real-time, memory-efficient inference. Across camera pose, video depth, and 3D reconstruction benchmarks, TTT3R matches or nears offline full-attention methods in online settings and demonstrates robust long-sequence capabilities. Limitations include residual forgetting on very long sequences and a gap to offline baselines, with State Reset proposed as an appendix-based mitigation and future TT-learning explorations suggested.

Abstract

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R

TTT3R: 3D Reconstruction as Test-Time Training

TL;DR

TTT3R reframes long-sequence 3D reconstruction as a test-time online learning problem, deriving a confidence-guided, per-token learning rate to update a fixed memory state without fine-tuning. The method plug-and-play improves length generalization for CUT3R, achieving strong online performance across pose, depth, and geometry tasks while maintaining real-time, memory-efficient inference. Across camera pose, video depth, and 3D reconstruction benchmarks, TTT3R matches or nears offline full-attention methods in online settings and demonstrates robust long-sequence capabilities. Limitations include residual forgetting on very long sequences and a gap to offline baselines, with State Reset proposed as an appendix-based mitigation and future TT-learning explorations suggested.

Abstract

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R

Paper Structure

This paper contains 20 sections, 8 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Left: CUT3R cut3r encodes observations into a state (memory) $\mathbf{S}_{t-1}$, then interacts with new observation $\mathbf{X}_t$ and retrieves 3D information by reading out the output token $\mathbf{Y}_t$. However, it suffers from the forgetting problem and degrades significantly as the number of input views increases. Right: We treat the state $\mathbf{S}_{t}$ as a fast weight updated via gradient descent, where the learning rate $\beta_t$ and the gradient $\nabla_t$ are predicted by the frozen slow weights. These slow weights are learned from training datasets and act as a meta-learner, enabling the fast weight to serve as an associative memory. In addition, TTT3R makes online state updates by balancing the retention of historical information ${\mathbf{S}_{t-1}}$ with a confidence-aware learning rate $\beta_t$. This visualization also incorporates a state reset process, please see \ref{['ssec:reset']} for details.
  • Figure 2: GPU memory cost for inference.
  • Figure 3: Sequence Modeling Layers. Full attention appends states, which incurs a quadratic cost. In contrast, vanilla RNNs use a fixed-size state with linear complexity, but they suffer from the forgetting problem. Our approach adopts Test-Time Training (TTT), treating the state as fast weights learned during test time via gradient descent with adaptive learning rates, which improves length generalization.
  • Figure 4: TTT3R Illustration. We present a training-free solution for scalable online 3D reconstruction that mitigates forgetting issue in CUT3R. (a) Vanilla CUT3R cut3r pipeline. (b) Our reformulation from a test-time training perspective introduces a confidence-guided state update, where alignment confidence between memory and observations serves as per-token learning rates. See Eq. \ref{['eqn:ttt3r']} for more details.
  • Figure 5: By incorporating image attention (i.e., $\mathbf{Q}_{\mathbf{S}_{t-1}} {\mathbf{K}^{\top}_{\mathbf{X}_t}}\in\mathbb{R}^{n\times(h\times w)}$) as per-token learning rates $\beta_t\in \mathbb{R}^{n \times 1}$, TTT3R mitigates catastrophic forgetting and facilitates online loop closure.
  • ...and 11 more figures