NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

Weirong Chen; Chuanxia Zheng; Ganlin Zhang; Andrea Vedaldi; Daniel Cremers

NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

Weirong Chen, Chuanxia Zheng, Ganlin Zhang, Andrea Vedaldi, Daniel Cremers

TL;DR

This work introduces a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non-pixel-aligned point clouds that outperforms state-of-the-art methods in terms of reconstruction accuracy and completeness.

Abstract

We present NOVA3R, an effective approach for non-pixel-aligned 3D reconstruction from a set of unposed images in a feed-forward manner. Unlike pixel-aligned methods that tie geometry to per-ray predictions, our formulation learns a global, view-agnostic scene representation that decouples reconstruction from pixel alignment. This addresses two key limitations in pixel-aligned 3D: (1) it recovers both visible and invisible points with a complete scene representation, and (2) it produces physically plausible geometry with fewer duplicated structures in overlapping regions. To achieve this, we introduce a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non-pixel-aligned point clouds. Extensive experiments on both scene-level and object-level datasets demonstrate that NOVA3R outperforms state-of-the-art methods in terms of reconstruction accuracy and completeness.

NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

TL;DR

Abstract

Paper Structure (48 sections, 2 equations, 13 figures, 6 tables)

This paper contains 48 sections, 2 equations, 13 figures, 6 tables.

Introduction
Related Work
Feed-Forward 3D Reconstruction.
Complete 3D Reconstruction.
Method
Problem Formulation
Problem Definition.
Data Preprocessing.
3D Latent Encoder-Decoder with Flow Matching
Diffusion-based 3D AutoEncoder.
Architecture.
Scene Representation with Learnable Tokens
Learnable Scene Tokens.
Architecture.
Experiments
...and 33 more sections

Figures (13)

Figure 1: NOVA3R enables non–pixel-aligned reconstruction by learning a global scene representation from unposed images. Compared to pixel-aligned methods, NOVA3R recovers both visible and occluded regions and produces more physically plausible geometry with fewer duplicated structures.
Figure 2: Comparison of different reconstruction paradigms. Our non-pixel-aligned approach combines feed-forward efficiency with a global, view-agnostic scene representation, removing the reliance on pixel-level supervision. NOVA3R provides a unified solution for various reconstruction tasks, achieving multi-view consistency and geometrically faithful results.
Figure 3: Overview of NOVA3R.Stage 1: a 3D point autoencoder encodes complete point clouds into latent scene tokens and decodes them with a flow-matching (FM) decoder. Stage 2: an image encoder with learnable scene tokens integrates multi-view information into a unified scene latent space, supervised by the FM loss with the Stage-1 decoder frozen. During inference, only the second stage pipeline is used to produce a complete, non–pixel-aligned point cloud.
Figure 4: Visible point clouds vs. complete point clouds. Our NOVA3R aims to recover the complete geometry within the input view's frustum.
Figure 5: Different Decoder Architectures. The independent decoder uses cross-attention only, while the joint decoder implements an efficient self-attention, which yields more precise structures.
...and 8 more figures

NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

TL;DR

Abstract

NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (13)