Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention

Mengfei Li; Xiaoxiao Long; Yixun Liang; Weiyu Li; Yuan Liu; Peng Li; Wenhan Luo; Wenping Wang; Yike Guo

Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention

Mengfei Li, Xiaoxiao Long, Yixun Liang, Weiyu Li, Yuan Liu, Peng Li, Wenhan Luo, Wenping Wang, Yike Guo

TL;DR

This work tackles the inefficiencies of extending Large Reconstruction Models to multi-view inputs by introducing M-LRM, a 3D-aware transformer framework. It integrates geometry-aware positional embeddings (GaPE) and geometry-aware cross attention (GCA) to inject explicit 3D priors and enforce cross-view coherence, initializing triplane tokens with geometry-informed priors. The approach yields higher-fidelity 3D geometries, faster convergence, and improved robustness over prior methods such as Instant3D and LGM, demonstrated on multi-view and single-view generation tasks. This advancement enables more reliable 3D reconstruction from sparse view sets and enhances practical applicability in 3D content generation and related domains.

Abstract

Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected. It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the triplane tokens. Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity. Experimental studies demonstrate that our model achieves a significant performance gain and faster training convergence. Project page: \url{https://murphylmf.github.io/M-LRM/}.

Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 9 figures, 6 tables)

This paper contains 22 sections, 3 equations, 9 figures, 6 tables.

Introduction
Related Work
3D Generation
Sparse-View Reconstruction
Large Reconstruction Model
M-LRM
Geometry-aware Positional Embeddings
Geometry-aware Cross Attention
Training Objectives
Implementation details
Data Preparation
Training Scheme
Network Architecture
Experiments
Multi-view Reconstruction
...and 7 more sections

Figures (9)

Figure 1: Our M-LRM is a novel framework that can handle both single-view generation tasks and multi-view (e.g. 6-view) reconstruction tasks. M-LRM facilitates high-quality novel views generation given a single view or multiple views as input.
Figure 2: Overview of M-LRM. M-LRM is a fully differentiable transformer-based framework, featuring an image encoder, geometry-aware position encoding and a multi-view cross attention block. Given multi-view images with corresponding camera extrinsic $E$ and intrinsic $K$, M-LRM incorporates the 2D and 3D features to conduct 3D-aware multi-view attention. Our transformer triplane decoder consists of $L$ layers, where each layer is composed of a multi-view cross attention Block and a vanilla transformer block. The proposed geometry-aware position encoding allows more detailed and realistic 3D generation.
Figure 3: Pipeline of Geometry-aware Positional Embedding(GaPE) and Geometry-aware Cross Attention(GCA). GaPE (top) fuse the enriched priors of input views in the grid features and randomly initialized triplane tokens to make the triplane tokens geometrically aware. GCA (bottom) use triplane tokens as query and find the spatially corresponding tokens in each feature grid as key and value to incorporates 3D prior and interrelations of the multiple input views.
Figure 4: The illustration of Geometry-aware Cross Attention. The vertices of geometry volume are projected back onto the 2D images. The features corresponding to the projected 2D locations are sampled as key and value in GCA.
Figure 5: Qualitative result of sparse-view reconstruction. We compare our generated results with two baseline methods. Given the same input conditional views, our tiny model can generate results of higher fidelity and geometries that are more reasonable. Our base variant uses two additional views as input (the rightmost two images of the input views).
...and 4 more figures

Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention

TL;DR

Abstract

Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (9)