VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein; Ruilong Li; Sérgio Agostinho; Zan Gojcic; Laura Leal-Taixé; Qunjie Zhou; Aljosa Osep

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep

TL;DR

A scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods that grows quadratically w.r.t. the number of input images, and retains global scene aggregation capability, outperforming other linear-time methods by large margins.

Abstract

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

TL;DR

Abstract

(Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a

image collection in just

seconds, achieving a

speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

Paper Structure (16 sections, 6 equations, 9 figures, 9 tables)

This paper contains 16 sections, 6 equations, 9 figures, 9 tables.

Introduction
Related Work
Feed-Forward 3D Reconstruction at Scale
Preliminaries
Can We Fit Rome into MLPs?
Large-scale Reconstruction
Experiments
Standard Benchmarks
Large-Scale 3D Reconstruction
Feed-forward Visual Localization
Ablations
Conclusion
Implementation Details
VGGT adjustments
Additional Results
...and 1 more sections

Figures (9)

Figure 1: VGG-T$^3$ replaces the global attention block in VGGT (left) with a linear-time alternative based on test-time training (right) to compress the KV space into a fixed-size MLP. We use 3 images for visualization purposes but this scales to arbitrary number of images.
Figure 2: Sequence-length generalization analysis.
Figure 3: Runtime ($\downarrow$) vs. Chamfer distance ($\downarrow$) for collections of size $\in \{ 100, 500, 1k\}$ on 7scenes dataset. In terms of reconstruction quality (Chamfer distance), we observe a small gap between VGG-T$^3$ and $O(n^2)$ baselines, that narrows with increasing number of images. However, for $1k$ input, VGGT takes ca. $11$min while VGG-T$^3$ only needs $58$ seconds ($11.6\times$ speedup). VGG-T$^3$ scales comparably to TTT3R and does not degrade w.r.t. increasing number of images.
Figure 4: Pointmap error with increasing number of images when varying the optimizer steps on the TTT objective.
Figure 5: Qualitative comparison. From left to right: VGGT, TTT3R, VGG-T$^3$ (Ours)
...and 4 more figures

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

TL;DR

Abstract

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Authors

TL;DR

Abstract

Table of Contents

Figures (9)