tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu
TL;DR
This work addresses the challenge of long-context, autoregressive 3D reconstruction from streaming images. It introduces tttLRM, which uses Test-Time Training and Large Chunk Test-Time Training (LaCT) blocks to compress multiple views into fast weights that are decoded into explicit 3D representations such as 3D Gaussian Splats and triplane NeRF features. The model supports both feedforward long-context reconstruction and autoregressive streaming reconstruction with linear computational complexity, leveraging pretrained novel-view synthesis models for initialization. Experiments on object and scene benchmarks show strong reconstruction quality and scalability, approaching the speed of explicit 3D representations while offering the flexibility to output multiple explicit formats for downstream tasks.
Abstract
We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.
