Table of Contents
Fetching ...

RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

Mochu Xiang, Zhelun Shen, Xuesong Li, Jiahui Ren, Jing Zhang, Chen Zhao, Shanshan Liu, Haocheng Feng, Jingdong Wang, Yuchao Dai

TL;DR

This work presents RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation at the attention level, and treats the KV-cache as an implicit 3D representation.

Abstract

Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: https://npucvr.github.io/RnG

RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

TL;DR

This work presents RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation at the attention level, and treats the KV-cache as an implicit 3D representation.

Abstract

Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: https://npucvr.github.io/RnG
Paper Structure (27 sections, 10 equations, 15 figures, 5 tables)

This paper contains 27 sections, 10 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: What can RnG do? Given a few unposed images of an object, 3D reconstruction foundation models like VGGT can recover the structure of observed regions, but leaves the unseen part un-modeled. RnG can estimate its complete 3D geometry within a second on an A800 GPU, using a single feed-forward transformer. RnG implicitly reconstructs 3D and render onto new viewpoints with appearance and geometry. By accumulating these rendered point maps , RnG can generate a complete 3D object, working like a virtual 3D scanner.
  • Figure 2: The Network Architecture of RnG. (a) Source view images are first tokenized using the DINO vision transformer; the Plücker ray map representing the target view point goes through a linear layer. After adding camera tokens for each view, all tokens will then alternately attend to global- and frame-level attention blocks. Finally, camera tokens from input views are used to estimate camera poses, while a point head and an RGB head process ray tokens from the target view, providing geometry and appearance estimations. (b) In inference, the model can cache K/V token from source views, synthesizing novel view geometry and geometry at a higher speed.
  • Figure 3: The reconstruction-guided causal attention.(a) During training, we decouple reconstruction and generation at the attention level inside global attention blocks. At inference time, the attention process is split into two steps: (b) source-view key value tokens are cached as an implicit 3D representation; (c) the KV-cache is queried by target view poses to generate novel views.
  • Figure 4: Visual comparison of novel view synthesis. Though RnG does not require accurate pose as input, it provides comparable visual quality with state-of-the-art pose-dependent methods like LVSM. Our model can hallucinate unseen regions with high 3D consistency.
  • Figure 5: Camera pose and point cloud visualization. Reconstructions are normalized to match GT's scale and are aligned to first frame's position (dark blue). The estimated camera pose from RnG highly aligns with the ground truth. Our back-projected point cloud from source views does not suffer from layering artifacts, presenting accurate object structures.
  • ...and 10 more figures