Table of Contents
Fetching ...

SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects

Jiayi Liu, Denys Iliash, Angel X. Chang, Manolis Savva, Ali Mahdavi-Amiri

TL;DR

The paper tackles creating high-fidelity 3D articulated household objects from a single resting-state image. It introduces a three-stage pipeline: infer a part connectivity graph from the image, generate abstract part attributes with a diffusion model conditioned on the image and graph, and retrieve meshes to assemble a coherent 3D articulated object. A diffusion-based denoiser with image cross-attention and graph-aware guidance is trained to produce plausible part configurations that respect the input while allowing variation to handle ambiguity; a GPT-4o module is used to derive the connectivity graph from the image, and mesh retrieval from a part library finalizes the asset. Evaluations on PartNet-Mobility and ACD show strong reconstruction quality, robust generalization, and favorable user-study results compared to state-of-the-art baselines, highlighting the method's potential for scalable, editable articulated-object creation from single images.

Abstract

We address the challenge of creating 3D assets for household articulated objects from a single image. Prior work on articulated object creation either requires multi-view multi-state input, or only allows coarse control over the generation process. These limitations hinder the scalability and practicality for articulated object modeling. In this work, we propose a method to generate articulated objects from a single image. Observing the object in resting state from an arbitrary view, our method generates an articulated object that is visually consistent with the input image. To capture the ambiguity in part shape and motion posed by a single view of the object, we design a diffusion model that learns the plausible variations of objects in terms of geometry and kinematics. To tackle the complexity of generating structured data with attributes in multiple domains, we design a pipeline that produces articulated objects from high-level structure to geometric details in a coarse-to-fine manner, where we use a part connectivity graph and part abstraction as proxies. Our experiments show that our method outperforms the state-of-the-art in articulated object creation by a large margin in terms of the generated object realism, resemblance to the input image, and reconstruction quality.

SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects

TL;DR

The paper tackles creating high-fidelity 3D articulated household objects from a single resting-state image. It introduces a three-stage pipeline: infer a part connectivity graph from the image, generate abstract part attributes with a diffusion model conditioned on the image and graph, and retrieve meshes to assemble a coherent 3D articulated object. A diffusion-based denoiser with image cross-attention and graph-aware guidance is trained to produce plausible part configurations that respect the input while allowing variation to handle ambiguity; a GPT-4o module is used to derive the connectivity graph from the image, and mesh retrieval from a part library finalizes the asset. Evaluations on PartNet-Mobility and ACD show strong reconstruction quality, robust generalization, and favorable user-study results compared to state-of-the-art baselines, highlighting the method's potential for scalable, editable articulated-object creation from single images.

Abstract

We address the challenge of creating 3D assets for household articulated objects from a single image. Prior work on articulated object creation either requires multi-view multi-state input, or only allows coarse control over the generation process. These limitations hinder the scalability and practicality for articulated object modeling. In this work, we propose a method to generate articulated objects from a single image. Observing the object in resting state from an arbitrary view, our method generates an articulated object that is visually consistent with the input image. To capture the ambiguity in part shape and motion posed by a single view of the object, we design a diffusion model that learns the plausible variations of objects in terms of geometry and kinematics. To tackle the complexity of generating structured data with attributes in multiple domains, we design a pipeline that produces articulated objects from high-level structure to geometric details in a coarse-to-fine manner, where we use a part connectivity graph and part abstraction as proxies. Our experiments show that our method outperforms the state-of-the-art in articulated object creation by a large margin in terms of the generated object realism, resemblance to the input image, and reconstruction quality.

Paper Structure

This paper contains 21 sections, 1 equation, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Our work proposes to generate 3D articulated objects from a single image observing the object in the resting state from a random view. Left: We design a pipeline to synthesize articulated assets progressively from coarse to fine details in a modular way. Right: Our method generates objects with varying part geometry and motion to account for the ambiguity in the input image.
  • Figure 2: Our method takes an object image as input and generates attributes of articulated parts, which are used to assemble the object via part mesh retrieval. We design a diffusion-based model for part generation, which is guided by a part connectivity graph and DINOv2 patch features of the input image. Our denoising network is built on layers of attention blocks. The graph constraint is injected into the graph relation module by converting to an adjacency matrix as the attention mask. The image features act as the keys and values in the cross attention to condition the part arrangement.
  • Figure 3: Attention maps for two example parts visualized at the $2^{\text{nd}}$ last layer.
  • Figure 4: Qualitative comparison on the PartNet-Mobility test set. For each set of results, the first column shows the predicted part connectivity graph (the wrong ones are denoted in red boxes), the second column shows the part arrangement and joint for each part for the object in the resting state where the part coloring corresponds to the node in the graph, and the third column shows the final assets of the articulated object. Our method outperforms the baselines with better graph prediction, more consistent part arrangement with the input image, and more plausible part articulations.
  • Figure 5: Qualitative comparison on the ACD dataset in a zero-shot testing. The first four rows show that our method can generate more geometrically accurate objects with plausible motions compared to the baselines. The red boxes denote incorrect part connectivity graphs relative to the ground truth. The first row shows that even when our predicted graph is different from the ground truth due to ambiguity in the image, our method can still generate a realistic and reasonable object. The last two rows show failure cases for two challenging input images. When the texture is complex or the part arrangement is cluttered, our method may not accurately recover some details (e.g., two knobs on drawer merged into one handle in $2^{nd}$ to last example; doors and drawers misplaced in last example).
  • ...and 7 more figures