Table of Contents
Fetching ...

Multi-view Pyramid Transformer: Look Coarser to See Broader

Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park

TL;DR

This work tackles scalable 3D reconstruction from large numbers of input views by introducing the Multi-view Pyramid Transformer (MVP). MVP combines a dual attention hierarchy—inter-view and intra-view—with a pyramidal feature aggregation scheme to achieve high-quality reconstructions in a single feed-forward pass while scaling to hundreds of views. Through extensive experiments on DL3DV, Tanks&Temples, and Mip-NeRF360, MVP delivers state-of-the-art generalizable reconstruction quality and real-time-like speed, significantly outpacing prior feed-forward methods and remaining close to optimization-based baselines in quality but far superior in efficiency. The approach establishes a scalable framework for large-scale 3D reconstruction that generalizes across diverse datasets and view configurations, with clear paths for future dynamic and geometry-supervised extensions.

Abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

Multi-view Pyramid Transformer: Look Coarser to See Broader

TL;DR

This work tackles scalable 3D reconstruction from large numbers of input views by introducing the Multi-view Pyramid Transformer (MVP). MVP combines a dual attention hierarchy—inter-view and intra-view—with a pyramidal feature aggregation scheme to achieve high-quality reconstructions in a single feed-forward pass while scaling to hundreds of views. Through extensive experiments on DL3DV, Tanks&Temples, and Mip-NeRF360, MVP delivers state-of-the-art generalizable reconstruction quality and real-time-like speed, significantly outpacing prior feed-forward methods and remaining close to optimization-based baselines in quality but far superior in efficiency. The approach establishes a scalable framework for large-scale 3D reconstruction that generalizes across diverse datasets and view configurations, with clear paths for future dynamic and geometry-supervised extensions.

Abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

Paper Structure

This paper contains 19 sections, 4 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Our method efficiently processes a wide range of input views, reconstructing diverse large-scale scenes in under 0.1–2.0 seconds. We utilize Viser yi2025viser for 3D scene visualization. Each marker in the plot represents performance under different numbers of input views: 16, 32, 64, 128, and 256 views for Ours, and 16, 32, and 64 views for both iLRM and Long-LRM. See Tab. \ref{['tab:quantitative result on dl3dv']} and \ref{['tab:quantitative result on dl3dv2']} for further details.
  • Figure 2: Architecture Overview. Given tokenized inputs, our model applies a three stage hierarchy of alternating attention blocks, varying in both self-attention coverage and token resolution. A Pyramidal Feature Aggregation module fuses the outputs from all stages, which are then passed to a final head for dense prediction.
  • Figure 3: Qualitative results on the DL3DV (top two rows), Tanks&Temples (third row), and Mip-NeRF360 (bottom row). For a fair and reliable comparison, we evaluate all methods with 32 input views, matching the training setup used for other feed-forward baselines.
  • Figure 4: Qualitative results on the 4-view RE10K dataset.
  • Figure 5: Attention visualization. For colored query patches (red, yellow, green) in the reference view, we highlight top-3 attended tokens: on the left, tokens attended within the group (blue overlay), and on the right, tokens attended within and outside the group (green overlay).
  • ...and 9 more figures