unPIC: A Geometric Multiview Prior for Image to 3D Synthesis
Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra
TL;DR
unPIC addresses the underspecified problem of turning a single image into multiple plausible 3D views by factorizing the task into p(geometry | image) and p(appearance | geometry, image) within a diffusion-based two-stage framework. The geometry is encoded as CROCS, a camera-relative NOCS-inspired representation that enforces cross-view correspondence and predictable geometry across arbitrary source poses. Empirically, this geometry-grounded, hierarchical approach yields superior shape and multiview consistency compared to geometry-free baselines on ObjaverseXL and unseen real-world datasets, while remaining robust to out-of-distribution inputs. The work argues that explicit geometric supervision and decoupling of geometry from appearance substantially improve generalization and practical applicability for image-to-3D synthesis.
Abstract
We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" predicts the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation to coordinate the generation of multiple target views simultaneously. We construct a predictable distribution of geometric features per target view to enable learnability across examples, and generalization to arbitrary inputs images. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats competing baselines such as CAT3D, EscherNet, Free3D, and One-2-3-45 on held-out objects from ObjaverseXL, as well as unseen real-world objects from Google Scanned Objects, Amazon Berkeley Objects, and the Digital Twin Catalog.
