Table of Contents
Fetching ...

UniQueR: Unified Query-based Feedforward 3D Reconstruction

Chensheng Peng, Quentin Herau, Jiezhi Yang, Yichen Xie, Yihan Hu, Wenzhao Zheng, Matthew Strong, Masayoshi Tomizuka, Wei Zhan

Abstract

We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions--in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.

UniQueR: Unified Query-based Feedforward 3D Reconstruction

Abstract

We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions--in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.
Paper Structure (21 sections, 9 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 9 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of pixel-aligned and query-based pipelines. Given input images, UniQueR predicts a sparse set of 3D queries that spawn Gaussians covering both observed surfaces and occluded regions in global space. Unlike pixel-aligned methods (e.g., AnySplat) that produce holes in unobserved areas, our query-based representation enables more complete 3D reconstruction with accurate geometry.
  • Figure 2: UniQueR pipeline overview. Given multi-view images, a ViT encoder with alternating attention extracts per-frame tokens and decodes camera poses and point maps. A set of 3D queries is refined through cross-attention with image tokens and self-attention among queries. Each query then spawns $K$ Gaussians, which are rendered via differentiable splatting for RGB and depth supervision.
  • Figure 3: Comparison of attention designs. We contrast full self-attention over concatenated image and query tokens with our decoupled design using cross-attention followed by inter-query self-attention.
  • Figure 4: Qualitative and geometric comparison. Top rows: rendered RGB on held-out novel views. Bottom rows: rendered depth maps. AnySplat produces blank regions in RGB and holes in depth where no input pixels provide coverage, due to its pixel-aligned representation. UniQueR fills in these occluded areas through 3D queries, yielding more complete geometry and fewer rendering artifacts.
  • Figure 5: Ablation studies on (a) the number of queries, (b) the number of Gaussians per query, and (c) model capacity. Increasing any of these factors leads to consistent improvements in PSNR, demonstrating clear scaling behavior in UniQueR.
  • ...and 2 more figures