Table of Contents
Fetching ...

Generative Perception of Shape and Material from Differential Motion

Xinran Nicole Han, Ko Nishino, Todd Zickler

TL;DR

The paper addresses the inherent ambiguity in inferring an object's shape and material from limited, possibly unlabeled observations. It introduces a pixel-space conditional diffusion framework, U-ViT3D-Mixer, that jointly samples time-varying surface normals and material maps from short videos of differential motion, capturing multimodal possibilities and leveraging motion to reduce ambiguity. Key contributions include a unified backbone for joint shape-material inference, emergence of multimodal predictions for static observations, and improved accuracy when motion is observed, with competitive results on real-world data. This ambiguity-aware approach has practical implications for embodied AI and cognitive modeling, enabling more robust visual reasoning under uncertain lighting and material properties.

Abstract

Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, and by using generative perception to capture visual ambiguities, our work suggests ways to improve visual reasoning in physically-embodied systems.

Generative Perception of Shape and Material from Differential Motion

TL;DR

The paper addresses the inherent ambiguity in inferring an object's shape and material from limited, possibly unlabeled observations. It introduces a pixel-space conditional diffusion framework, U-ViT3D-Mixer, that jointly samples time-varying surface normals and material maps from short videos of differential motion, capturing multimodal possibilities and leveraging motion to reduce ambiguity. Key contributions include a unified backbone for joint shape-material inference, emergence of multimodal predictions for static observations, and improved accuracy when motion is observed, with competitive results on real-world data. This ambiguity-aware approach has practical implications for embodied AI and cognitive modeling, enabling more robust visual reasoning under uncertain lighting and material properties.

Abstract

Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, and by using generative perception to capture visual ambiguities, our work suggests ways to improve visual reasoning in physically-embodied systems.

Paper Structure

This paper contains 23 sections, 14 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Left: We train our generative perception model to jointly infer shape and materials using synthetic three-frame videos of objects undergoing differential motions. Right: Our model (i) generalizes to captured real-world photos; (ii) leverages continuous observations to disentangle complex objects, and (iii) shows an emergent ability to provide multiple hypotheses for ambiguous static images, such as convex, concave and planar 'postcard' explanations.
  • Figure 2: Our parameter-efficient denoising network, U-ViT3D-Mixer, takes in a channel-wise concatenation of conditional video frames and noisy shape-and-material frames. At high spatial resolutions, it uses efficient local 3D blocks (middle) with decoupled spatial, temporal, and channel-wise interactions. At lower spatial resolutions, it uses global transformer layers with full 3D attention.
  • Figure 3: Qualitative comparisons of estimated shape and albedo/texture from a static scene. For fair comparison of scale-invariant albedos, we visualize the scaled albedo from each model closest to the ground truth in the masked object region. Our model achieves comparable shape predictions with existing baselines that specializes at shape prediction, and it achieves better quality in albedo estimation. Our model also produces greater spatial detail.
  • Figure 4: Our model exhibits multimodal shape perception on ambiguous visual stimuli presented in Han et al. han2024multistable despite only being trained on everyday objects and without specific data augmentation.
  • Figure 5: Shape and material estimates from our full model. For each three-frame test video, we show one representative frame. The first four columns use randomly sampled materials, and the last four use original materials. Note that some specular objects are quite challenging; we demonstrate in the supplementary videos how motion aids disambiguation. On the right we show relighting examples using the estimated shape, albedo, and reflectance parameters under directional lighting.
  • ...and 5 more figures