Table of Contents
Fetching ...

Multistable Shape from Shading Emerges from Patch Diffusion

Xinran Nicole Han, Todd Zickler, Ko Nishino

TL;DR

This work introduces a model that reconstructs a multimodal distribution of shapes from a single shading image, which aligns with the human experience of multistable perception and may inspire new architectures for stochastic 3D shape perception that are more efficient and better aligned with human experience.

Abstract

Models for inferring monocular shape of surfaces with diffuse reflection -- shape from shading -- ought to produce distributions of outputs, because there are fundamental mathematical ambiguities of both continuous (e.g., bas-relief) and discrete (e.g., convex/concave) types that are also experienced by humans. Yet, the outputs of current models are limited to point estimates or tight distributions around single modes, which prevent them from capturing these effects. We introduce a model that reconstructs a multimodal distribution of shapes from a single shading image, which aligns with the human experience of multistable perception. We train a small denoising diffusion process to generate surface normal fields from $16\times 16$ patches of synthetic images of everyday 3D objects. We deploy this model patch-wise at multiple scales, with guidance from inter-patch shape consistency constraints. Despite its relatively small parameter count and predominantly bottom-up structure, we show that multistable shape explanations emerge from this model for ambiguous test images that humans experience as being multistable. At the same time, the model produces veridical shape estimates for object-like images that include distinctive occluding contours and appear less ambiguous. This may inspire new architectures for stochastic 3D shape perception that are more efficient and better aligned with human experience.

Multistable Shape from Shading Emerges from Patch Diffusion

TL;DR

This work introduces a model that reconstructs a multimodal distribution of shapes from a single shading image, which aligns with the human experience of multistable perception and may inspire new architectures for stochastic 3D shape perception that are more efficient and better aligned with human experience.

Abstract

Models for inferring monocular shape of surfaces with diffuse reflection -- shape from shading -- ought to produce distributions of outputs, because there are fundamental mathematical ambiguities of both continuous (e.g., bas-relief) and discrete (e.g., convex/concave) types that are also experienced by humans. Yet, the outputs of current models are limited to point estimates or tight distributions around single modes, which prevent them from capturing these effects. We introduce a model that reconstructs a multimodal distribution of shapes from a single shading image, which aligns with the human experience of multistable perception. We train a small denoising diffusion process to generate surface normal fields from patches of synthetic images of everyday 3D objects. We deploy this model patch-wise at multiple scales, with guidance from inter-patch shape consistency constraints. Despite its relatively small parameter count and predominantly bottom-up structure, we show that multistable shape explanations emerge from this model for ambiguous test images that humans experience as being multistable. At the same time, the model produces veridical shape estimates for object-like images that include distinctive occluding contours and appear less ambiguous. This may inspire new architectures for stochastic 3D shape perception that are more efficient and better aligned with human experience.
Paper Structure (34 sections, 10 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 34 sections, 10 equations, 14 figures, 3 tables, 2 algorithms.

Figures (14)

  • Figure 1: Many shapes (left) can explain the same image (middle) under different lighting, including flattened and tilted versions and convex/concave flips. The concave/convex flip in this example is also perceived by humans, often aided by rotating the image clockwise by 90 degrees. Previous methods for inferring either surface normals (SIRFS barron2014shape, Derender3D wimbauer2022rendering, Wonder3D long2023wonder3d) or depth (Marigold ke2023repurposing, Depth Anything yang2024depth) produce a single shape estimate or a unimodal distribution. Ours produces a multimodal distribution that matches the perceived flip. (Image adapted from kunsberg2021boundaries.)
  • Figure 2: Training patches are cropped from synthetic images of ordinary diffuse objects, and during training, a small diffusion model learns to denoise the normal field $x_0^u$ for patch $u$ from a random sample $x_T^u$ conditioned on the patch intensities $c^u$. During inference, the model is applied in parallel to non-overlapping patches, with guidance from inter-patch shape-consistency constraints to minimize the curvature smoothness loss $\mathcal{L}_S$ and integrability loss $\mathcal{L}_I$.
  • Figure 3: Top: Illustration of multiscale sampling across two scales in a fine-coarse-fine "V-cycle", with conditional images omitted for simplicity. In practice, our V-cycle covers more than two scales. Left: The $N \& R$ subroutine injects noise to an earlier timestep $0<t<T$ and then resumes guided sampling (Fig. \ref{['fig:inference']}) at that scale. Right: Optional intermediate guidance comes from lighting consistency (LCG), where each patch nominates a dominant light direction and then some patches flip in response to those nominations. Pseudocode is in the appendix.
  • Figure 4: Ablations, and comparison to human subjects using image and psychophysics data from nartker2017distortions. Left: Ablations demonstrate the importance of each component. Right: Depth cross-sections extracted from four (integrated) samples from the convex mode of our full model exhibit relief-like variations similar to those reported across human subjects. (The dashed line is the depth that was used to render the input image.)
  • Figure 5: Normals produced by our model for various synthetic test surfaces rendered with directional light sources. For depth maps, brighter is closer. "Reference" depicts the shapes---each with a convex/concave counterpart---that were used to render the input images. We find that our reconstructions are more accurate and diverse than other methods.
  • ...and 9 more figures