Table of Contents
Fetching ...

Seeing a Rose in Five Thousand Ways

Yunzhi Zhang, Shangzhe Wu, Noah Snavely, Jiajun Wu

TL;DR

This work tackles the problem of learning object intrinsics—3D geometry, texture, and material properties—from a single image containing multiple instances of the same object type. It introduces a generative framework that represents intrinsics with neural fields (e.g., a 3D shape via a Signed Distance Function $f_\theta$, albedo via $a_\psi$, and a shininess parameter $\alpha$) conditioned on a latent code and rendered under environment extrinsics using a Phong-like lighting model and neural volume rendering. Training uses an adversarial setup on image crops with pose-aware regularization and scale/translation augmentations to enforce 3D-consistency and robustness in the limited-data regime. The model demonstrates recovery of object intrinsics from in-the-wild and synthetic images, enabling novel-view synthesis, relighting, and generation that surpasses baselines like GNeRF, Neural-PIL, and NeRD on multiple metrics, with notable gains in depth, albedo, and image realism. This approach offers a practical pathway to 3D-aware generation from minimal data, with applications in shape reconstruction, relighting, and controllable instance generation in real-world scenes.

Abstract

What is a rose, visually? A rose comprises its intrinsics, including the distribution of geometry, texture, and material specific to its object category. With knowledge of these intrinsic properties, we may render roses of different sizes and shapes, in different poses, and under different lighting conditions. In this work, we build a generative model that learns to capture such object intrinsics from a single image, such as a photo of a bouquet. Such an image includes multiple instances of an object type. These instances all share the same intrinsics, but appear different due to a combination of variance within these intrinsics and differences in extrinsic factors, such as pose and illumination. Experiments show that our model successfully learns object intrinsics (distribution of geometry, texture, and material) for a wide range of objects, each from a single Internet image. Our method achieves superior results on multiple downstream tasks, including intrinsic image decomposition, shape and image generation, view synthesis, and relighting.

Seeing a Rose in Five Thousand Ways

TL;DR

This work tackles the problem of learning object intrinsics—3D geometry, texture, and material properties—from a single image containing multiple instances of the same object type. It introduces a generative framework that represents intrinsics with neural fields (e.g., a 3D shape via a Signed Distance Function , albedo via , and a shininess parameter ) conditioned on a latent code and rendered under environment extrinsics using a Phong-like lighting model and neural volume rendering. Training uses an adversarial setup on image crops with pose-aware regularization and scale/translation augmentations to enforce 3D-consistency and robustness in the limited-data regime. The model demonstrates recovery of object intrinsics from in-the-wild and synthetic images, enabling novel-view synthesis, relighting, and generation that surpasses baselines like GNeRF, Neural-PIL, and NeRD on multiple metrics, with notable gains in depth, albedo, and image realism. This approach offers a practical pathway to 3D-aware generation from minimal data, with applications in shape reconstruction, relighting, and controllable instance generation in real-world scenes.

Abstract

What is a rose, visually? A rose comprises its intrinsics, including the distribution of geometry, texture, and material specific to its object category. With knowledge of these intrinsic properties, we may render roses of different sizes and shapes, in different poses, and under different lighting conditions. In this work, we build a generative model that learns to capture such object intrinsics from a single image, such as a photo of a bouquet. Such an image includes multiple instances of an object type. These instances all share the same intrinsics, but appear different due to a combination of variance within these intrinsics and differences in extrinsic factors, such as pose and illumination. Experiments show that our model successfully learns object intrinsics (distribution of geometry, texture, and material) for a wide range of objects, each from a single Internet image. Our method achieves superior results on multiple downstream tasks, including intrinsic image decomposition, shape and image generation, view synthesis, and relighting.
Paper Structure (35 sections, 9 equations, 7 figures, 3 tables)

This paper contains 35 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: From a single image, our model learns to infer object intrinsics---the distributions of the geometry, texture, and material of object instances within the image. The model can then generate new instances of the object type, and it allows us to view the object under different poses and lighting conditions. Project page at https://cs.stanford.edu/ yzzhang/projects/rose/.
  • Figure 2: Model overview. We propose a generative model that recovers the object intrinsics, including 3D shape and albedo, from a single input image with multiple similar object instances with instance masks. To synthesize an image, we sample from the learned object intrinsics (orange box) to obtain the shape and albedo for a specific instance, whose identity is controlled by an underlying latent space. Then, environmental extrinsics (blue box) are incorporated in the forward rendering procedure to obtain shading and appearance for the instance. Finally, the 3D representation for appearance is used to render images in 2D under arbitrary viewpoints. These synthesized images are then used, along with the real examples from the input image, in a generative adversarial framework to learn the object intrinsics.
  • Figure 3: Learning from images in the wild. Given a single 2D image containing dozens of similar object instances with masks, our model can recover a distribution of 3D shape and albedo from observations of the instances. We sample from the learnt distribution to obtain albedo and normal for a specific instance, as shown in column (b-c). Two columns in (d) show two different views for the same instance. At test time, our model can synthesize instances under novel views shown in (e) and novel lighting conditions shown in (f).
  • Figure 4: Results for test-time relighting. The 6 columns show renderings with different lighting conditions unseen during training.
  • Figure 5: Results of interpolation in the latent space. From left to right, each column of images corresponds to an instance with a specific latent code interpolated between two latent vectors. Instances from all columns are rendered with the same pose.
  • ...and 2 more figures