Table of Contents
Fetching ...

Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Nobuo Yoshii, Xinran Nicole Han, Ryo Kawahara, Todd Zickler, Ko Nishino

Abstract

We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Abstract

We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.
Paper Structure (42 sections, 17 equations, 14 figures, 6 tables)

This paper contains 42 sections, 17 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: MultiGP is an ambiguity-aware inverse rendering method that samples the reflectance and texture of each object alongside the global scene illumination from a single image. By leveraging the shared illumination across multiple objects, MultiGP employs a novel end-to-end architecture featuring tailored guidance, axial attention, and ControlNet structures to resolve the otherwise ambiguous radiometric disentanglement.
  • Figure 2: Overview of MultiGP. Given a scene with multiple objects, textures are first estimated via a diffusion model $q_\phi$ that accounts for global light transport. The resulting texture-free appearances are transformed into reflectance maps, from which a multi-object diffusion model $q_\theta$ estimates a shared illumination and respective reflectances. Finally, a ControlNet refines the textures for physical consistency with the estimated lighting and reflectance through a renderer.
  • Figure 3: (a) Multi-Object Coordinate Scheduling guides the appearance of diverse objects toward a single shared illumination estimate. At each denoising step, the process is governed by the estimated reflectances of multiple objects alongside a known mirror reflectance. Since the mirror reflectance uniquely corresponds to the illumination, these diverse inputs stochastically converge to a consistent, shared environment map. (b) Multi-Object Axial Attention shares reflectance and spatial information across different reflectance maps. This mechanism enables MultiGP to integrate complementary frequency information—otherwise attenuated by individual reflectances—and aggregate visible lighting directions at every denoising step.
  • Figure 4: Distribution of illumination samples from MultiGP and MultiGP (single). (a) Heterogeneous reflectances: MultiGP effectively integrates objects with different reflectances. (a-1) shows the complementary Spherical Harmonic (SH) frequency spectra of three input reflectance maps. (a-2) provides a 2D PCA visualization of 100 samples; orange dots represent the joint MultiGP distribution, while other colors represent individual MultiGP (single) estimates. The joint distribution (orange) captures the "ground truth" with the highest density. (b) Heterogneous masks: Using the same reflectance with different masks demonstrates that MultiGP effectively integrates varied object shapes. Here too, the MultiGP distribution (orange) densely encompasses the "ground truth."
  • Figure 5: Illumination estimates on the Stanford-ORB dataset. For a fair comparison, we show scaled illumination results closest to the ground truth. For existing methods, we select the result from the object yielding the best logRMSE score. MultiGP faithfully captures the ground truth illumination structure with high-fidelity.
  • ...and 9 more figures