Generative Perception of Shape and Material from Differential Motion
Xinran Nicole Han, Ko Nishino, Todd Zickler
TL;DR
The paper addresses the inherent ambiguity in inferring an object's shape and material from limited, possibly unlabeled observations. It introduces a pixel-space conditional diffusion framework, U-ViT3D-Mixer, that jointly samples time-varying surface normals and material maps from short videos of differential motion, capturing multimodal possibilities and leveraging motion to reduce ambiguity. Key contributions include a unified backbone for joint shape-material inference, emergence of multimodal predictions for static observations, and improved accuracy when motion is observed, with competitive results on real-world data. This ambiguity-aware approach has practical implications for embodied AI and cognitive modeling, enabling more robust visual reasoning under uncertain lighting and material properties.
Abstract
Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, and by using generative perception to capture visual ambiguities, our work suggests ways to improve visual reasoning in physically-embodied systems.
