SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects
Jiayi Liu, Denys Iliash, Angel X. Chang, Manolis Savva, Ali Mahdavi-Amiri
TL;DR
The paper tackles creating high-fidelity 3D articulated household objects from a single resting-state image. It introduces a three-stage pipeline: infer a part connectivity graph from the image, generate abstract part attributes with a diffusion model conditioned on the image and graph, and retrieve meshes to assemble a coherent 3D articulated object. A diffusion-based denoiser with image cross-attention and graph-aware guidance is trained to produce plausible part configurations that respect the input while allowing variation to handle ambiguity; a GPT-4o module is used to derive the connectivity graph from the image, and mesh retrieval from a part library finalizes the asset. Evaluations on PartNet-Mobility and ACD show strong reconstruction quality, robust generalization, and favorable user-study results compared to state-of-the-art baselines, highlighting the method's potential for scalable, editable articulated-object creation from single images.
Abstract
We address the challenge of creating 3D assets for household articulated objects from a single image. Prior work on articulated object creation either requires multi-view multi-state input, or only allows coarse control over the generation process. These limitations hinder the scalability and practicality for articulated object modeling. In this work, we propose a method to generate articulated objects from a single image. Observing the object in resting state from an arbitrary view, our method generates an articulated object that is visually consistent with the input image. To capture the ambiguity in part shape and motion posed by a single view of the object, we design a diffusion model that learns the plausible variations of objects in terms of geometry and kinematics. To tackle the complexity of generating structured data with attributes in multiple domains, we design a pipeline that produces articulated objects from high-level structure to geometric details in a coarse-to-fine manner, where we use a part connectivity graph and part abstraction as proxies. Our experiments show that our method outperforms the state-of-the-art in articulated object creation by a large margin in terms of the generated object realism, resemblance to the input image, and reconstruction quality.
