Table of Contents
Fetching ...

unPIC: A Geometric Multiview Prior for Image to 3D Synthesis

Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra

TL;DR

unPIC addresses the underspecified problem of turning a single image into multiple plausible 3D views by factorizing the task into p(geometry | image) and p(appearance | geometry, image) within a diffusion-based two-stage framework. The geometry is encoded as CROCS, a camera-relative NOCS-inspired representation that enforces cross-view correspondence and predictable geometry across arbitrary source poses. Empirically, this geometry-grounded, hierarchical approach yields superior shape and multiview consistency compared to geometry-free baselines on ObjaverseXL and unseen real-world datasets, while remaining robust to out-of-distribution inputs. The work argues that explicit geometric supervision and decoupling of geometry from appearance substantially improve generalization and practical applicability for image-to-3D synthesis.

Abstract

We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" predicts the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation to coordinate the generation of multiple target views simultaneously. We construct a predictable distribution of geometric features per target view to enable learnability across examples, and generalization to arbitrary inputs images. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats competing baselines such as CAT3D, EscherNet, Free3D, and One-2-3-45 on held-out objects from ObjaverseXL, as well as unseen real-world objects from Google Scanned Objects, Amazon Berkeley Objects, and the Digital Twin Catalog.

unPIC: A Geometric Multiview Prior for Image to 3D Synthesis

TL;DR

unPIC addresses the underspecified problem of turning a single image into multiple plausible 3D views by factorizing the task into p(geometry | image) and p(appearance | geometry, image) within a diffusion-based two-stage framework. The geometry is encoded as CROCS, a camera-relative NOCS-inspired representation that enforces cross-view correspondence and predictable geometry across arbitrary source poses. Empirically, this geometry-grounded, hierarchical approach yields superior shape and multiview consistency compared to geometry-free baselines on ObjaverseXL and unseen real-world datasets, while remaining robust to out-of-distribution inputs. The work argues that explicit geometric supervision and decoupling of geometry from appearance substantially improve generalization and practical applicability for image-to-3D synthesis.

Abstract

We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" predicts the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation to coordinate the generation of multiple target views simultaneously. We construct a predictable distribution of geometric features per target view to enable learnability across examples, and generalization to arbitrary inputs images. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats competing baselines such as CAT3D, EscherNet, Free3D, and One-2-3-45 on held-out objects from ObjaverseXL, as well as unseen real-world objects from Google Scanned Objects, Amazon Berkeley Objects, and the Digital Twin Catalog.

Paper Structure

This paper contains 21 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Top: A hierarchical approach to novel-view synthesis. A prior models multiview geometric features from a single image and are jointly decoded to the target novel-view images. Our intermediate features, CROCS, establish point-to-point correspondence across views. Bottom: Samples from the prior and decoder. Our model exhibits transferrable shape understanding having never seen a real-world pixel.
  • Figure 2: Schrödinger's cup: two sets of valid novel views, following different trajectories in representation space. The observed view does not reveal whether the cup has a handle or not.
  • Figure 3: Camera-Relative Object Coordinate Spaces. We show two data-points (Left and Right columns) obtained from one object. Top-left: The wireframe shows the RGB reference cube used to paint the object surface. The large camera denotes the source view, whereas the smaller cameras denote (3 of 7) novel views. Top-right: Say all camera locations are rotated by $\theta = 120$ degrees around the vertical axis (the object stays fixed). Then we also rotate the color reference cube by the same degree. This ensures each camera faces the same side of the cube that it was facing prior to the cameras' rotation. Bottom: Target CROCS images in clockwise order corresponding to the cameras above. In a given data-point, any part of the object is consistently colored across target images. Across data-points, although a given part of the object may change colors, its color is predictable based on its location(s) in the target image(s). Each target view has a consistent color bias that is learnable across examples.
  • Figure 4: Qualitative comparison. The left column shows the source image, while the remaining images in each row are predicted novel views. One-2-3-45 produces multiview inconsistencies. CAT3D can squash the shapes in unseen views.
  • Figure 5: While all models produce plausible images, their shapes and poses can be off. We compare masks with the ground-truth.
  • ...and 8 more figures