Table of Contents
Fetching ...

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu

TL;DR

This work addresses the challenge of long-context, autoregressive 3D reconstruction from streaming images. It introduces tttLRM, which uses Test-Time Training and Large Chunk Test-Time Training (LaCT) blocks to compress multiple views into fast weights that are decoded into explicit 3D representations such as 3D Gaussian Splats and triplane NeRF features. The model supports both feedforward long-context reconstruction and autoregressive streaming reconstruction with linear computational complexity, leveraging pretrained novel-view synthesis models for initialization. Experiments on object and scene benchmarks show strong reconstruction quality and scalability, approaching the speed of explicit 3D representations while offering the flexibility to output multiple explicit formats for downstream tasks.

Abstract

We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

TL;DR

This work addresses the challenge of long-context, autoregressive 3D reconstruction from streaming images. It introduces tttLRM, which uses Test-Time Training and Large Chunk Test-Time Training (LaCT) blocks to compress multiple views into fast weights that are decoded into explicit 3D representations such as 3D Gaussian Splats and triplane NeRF features. The model supports both feedforward long-context reconstruction and autoregressive streaming reconstruction with linear computational complexity, leveraging pretrained novel-view synthesis models for initialization. Experiments on object and scene benchmarks show strong reconstruction quality and scalability, approaching the speed of explicit 3D representations while offering the flexibility to output multiple explicit formats for downstream tasks.

Abstract

We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.
Paper Structure (21 sections, 5 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: We propose tttLRM, a Large Reconstruction Model based on Test-Time Training, enabling high-resolution, long-context, autoregressive 3D reconstruction. Our model achieves 1) high-resolution (1024px) single-image-to-3D reconstruction via a multi-view generator 2) long-context (64 input views) and feedforward 3DGS reconstruction, and supports 3) autoregressive streaming reconstruction.
  • Figure 2: Given a set of posed input images, tttLRM encodes them into tokens (green boxes) after patchifying. The input tokens are fed into the LaCT block (shown in the blue frame) where fast weights are updated accordingly. Another set of virtual tokens (blue boxes) are used to query the updated fast weights, and decoded into 3D representations like 3DGS for high-quality novel view synthesis.
  • Figure 3: Illustration of distributed feedforward reconstruction training. First, image tokens are sharded across GPUs, and each GPU predicts Gaussians for its assigned virtual views after the fast weights are synchronized. The predicted Gaussians are then gathered to construct the full scene, after which each GPU renders a subset of novel views and computes its respective losses. Gradients are finally all reduced and backpropagated across all devices.
  • Figure 4: Qualitative comparison between our method and baseline approaches. Our model reconstructs the 3DGS scene with higher fidelity than both optimization-based and feedforward baselines, as also reflected in the PSNR metrics. Please zoom in for a better comparison.
  • Figure 5: We demonstrate that our high-resolution $1024 \times 1024$ 3DGS tttLRM can be effectively used for image-to-3D generation when combined with a multi-view generator. Our model enables the reconstruction of fine-grained, photorealistic details e.g., hair, fur, and text, from the input images. Video results are provided in the supplemental material.
  • ...and 3 more figures