PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Xiang Zhang; Sohyun Yoo; Hongrui Wu; Chuan Li; Jianwen Xie; Zhuowen Tu

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Xiang Zhang, Sohyun Yoo, Hongrui Wu, Chuan Li, Jianwen Xie, Zhuowen Tu

TL;DR

PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications, and augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image.

Abstract

We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

TL;DR

Abstract

Paper Structure (38 sections, 6 equations, 5 figures, 9 tables)

This paper contains 38 sections, 6 equations, 5 figures, 9 tables.

Introduction
Related Work
3D Scene Reconstruction from a Single Image
Native Mesh Generation
Method
Preliminary
Repurposing the Point-Cloud Encoder
Injecting Pixel-Aligned Image Features
Scene Context Aggregation
Tokenization
Object Pose Tokenization
Object Mesh Tokenization
Final Token Sequence
Training
Experiments
...and 23 more sections

Figures (5)

Figure 1: Comparison of PixARMesh with recent compositional scene reconstruction methods. PixARMesh predicts object poses and reconstructs native meshes in a single autoregressive decoding process, without relying on SDF-based surface extraction or layout optimization, producing compact and artist-ready mesh outputs.
Figure 2: Pipeline overview. Given an RGB image, we use pretrained models to extract the depth point cloud and image features for both the target object $i$ and the global scene. These local and global cues are fed into the Pixel-Aligned PC-Encoder to produce the fused latent code, which is then aggregated into a single latent vector via cross-attention. This latent vector conditions the Transformer Decoder, which predicts the object's pose followed by its mesh token sequence.
Figure 3: Qualitative comparisons on the 3D-FRONT 3DFRONT dataset. For PixARMesh, we also show the mesh wireframe to highlight geometric quality.
Figure 4: Qualitative results on real images from Pix3D sun2018pix3d, Matterport3D chang2017matterport3d, and ScanNet dai2017scannet datasets.
Figure C.1: Additional qualitative results on real images from Pix3D sun2018pix3d, Matterport3D chang2017matterport3d and ScanNet dai2017scannet.

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

TL;DR

Abstract

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)