Table of Contents
Fetching ...

MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, Nandita Vijaykumar

TL;DR

Across large-scale datasets, including 7-Scenes, NRGBD, Tanks&Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

Abstract

Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

TL;DR

Across large-scale datasets, including 7-Scenes, NRGBD, Tanks&Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

Abstract

Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.
Paper Structure (27 sections, 9 equations, 16 figures, 11 tables)

This paper contains 27 sections, 9 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Given a large unordered set of 1,000 input images, MERG3R reconstructs accurate camera poses and a high-quality point cloud. Despite the long sequence of images that may not fit on device memory and challenging viewpoints, our pipeline enables scalable and reliable geometry reconstruction. Project page: https://leochengkx.github.io/MERG3R/
  • Figure 2: Overview of our large-scale 3D reconstruction pipeline. Given an unordered set of images, we first sort them into a pseudo-video sequence, then split the sequence into multiple interleaved subsets. Each subset is independently processed by a geometric foundation model to produce local pointmaps and poses. The resulting clusters are aligned into a common reference frame and jointly refined via global bundle adjustment, producing a coherent final reconstruction.
  • Figure 3: Illustration of the image partitioning process. Given an unordered set of $n$ images, we first compute the visual-similarity matrix $\mathbf{M} \in \mathbb{R}^{n \times n}$ and use it to search for a Hamiltonian path (shown in red) to produce the pseudo-video sequence. We then reorder the images using interleaved sampling and divide them into overlapping clusters.
  • Figure 4: Qualitative comparison of a predicted camera trajectory on a 300-image sequence from the Cambridge Landmarks dataset. Each subplot shows the estimated camera poses (red) overlaid with ground truth trajectories (green). Our method produces accurate and consistent trajectories, effectively handling long sequences with hundreds of frames.
  • Figure 5: Qualitative comparison of 3D reconstructions on short (appox. 300–500 images) and long (approx. 1,000 images) sequences. Our method (Ours + Pi-3) produces sharper and more complete point clouds than CUT3R, TTT3R, and $\pi^3$. Competing methods fail or run out of memory (OOM) on long sequences, while ours remains stable.
  • ...and 11 more figures