Table of Contents
Fetching ...

ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski

TL;DR

ZipMap is introduced, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods and the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

Abstract

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training

TL;DR

ZipMap is introduced, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods and the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

Abstract

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
Paper Structure (19 sections, 13 equations, 10 figures, 13 tables)

This paper contains 19 sections, 13 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: ZipMap is an efficient feed-forward 3D reconstruction model whose runtime scales linearly with the number of input views while maintaining or exceeding the reconstruction quality of state-of-the-art quadratic-time systems. Left: Given a long input sequence, ZipMap reconstructs image depths, dense 3D point clouds, and camera trajectory in a single forward pass. Right: Compared to quadratic-time models (VGGT and $\pi^3$), ZipMap matches or surpasses their prediction accuracy (lower ATE, top) while scaling linearly in runtime (bottom). At 750 frames, our method runs in under 10 seconds, over $20\times$ faster than VGGT.
  • Figure 2: Method Overview. ZipMap is a stateful feed-forward model with local window attention and large-chunk TTT layers NIPS2017_attzhang2025test. Given $N$ input images, a single linear-time pass predicts camera poses, depth maps, and point maps while storing a compact scene representation in TTT fast weights, which can be queried in real time at novel cameras to synthesize new-view point maps.
  • Figure 3: Example reconstruction results A sparse subset of input images are shown on the left, and a visualization of the output 3D reconstructions are shown on the right. Note that our method performs well on challenging cases like long sequence inputs, dynamic scenes and internet photo collections.
  • Figure 4: Long-sequence camera evaluation on DL3DV. We evaluate camera pose accuracy (ATE$\downarrow$) on the DL3DV test set ling2024dl3dv under two protocols: Left: increasing scene scale by using the first $N$ frames of each sequence; Right: increasing view density by uniformly subsampling $N$ frames along a fixed trajectory. Our method maintains low error and matches quadratic-time baselines ($\pi^3$, VGGT) while other linear-time methods (CUT3R, TTT3R) degrade significantly as $N$ grows.
  • Figure 5: Querying Unseen Structure. Left: input images (a), GT images at query poses (b), and our predicted depth at those poses (c). Middle: point cloud reconstructed from input images only. Right: point cloud after querying (right column), where the queried point cloud is merged with the input image point cloud. This demonstrates our model's ability to infer common 3D structure (e.g., walls, floors, and ground) in the unseen regions, thereby indicating an understanding of basic 3D scene priors.
  • ...and 5 more figures