Table of Contents
Fetching ...

DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

TL;DR

DragAPart presents a motion prior for articulated objects by fine-tuning a pre-trained image generator on a synthetic Drag-a-Move dataset with a novel multi-resolution drag encoding. It enables part-level deformations in response to drags and generalizes to real images and unseen categories through domain randomization. The approach outperforms prior drag-conditioned methods in both quantitative metrics and qualitative assessments, and enables downstream tasks such as moving-part segmentation and motion analysis. The Drag-a-Move dataset provides ground-truth drags and articulations to support data-driven learning of fine-grained dynamics.

Abstract

We introduce DragAPart, a method that, given an image and a set of drags as input, generates a new image of the same object that responds to the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. We start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.

DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

TL;DR

DragAPart presents a motion prior for articulated objects by fine-tuning a pre-trained image generator on a synthetic Drag-a-Move dataset with a novel multi-resolution drag encoding. It enables part-level deformations in response to drags and generalizes to real images and unseen categories through domain randomization. The approach outperforms prior drag-conditioned methods in both quantitative metrics and qualitative assessments, and enables downstream tasks such as moving-part segmentation and motion analysis. The Drag-a-Move dataset provides ground-truth drags and articulations to support data-driven learning of fine-grained dynamics.

Abstract

We introduce DragAPart, a method that, given an image and a set of drags as input, generates a new image of the same object that responds to the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. We start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.
Paper Structure (50 sections, 5 equations, 10 figures, 4 tables)

This paper contains 50 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Examples of our DragAPart. Each drag in DragAPart represents a part-level interaction, resulting in a physically plausible deformation of the object shape. DragAPart is trained on a new synthetic dataset, Drag-a-Move, for this task, and generalizes well to real data and even unseen categories. The trained model also can be used for segmenting movable parts and analyzing motion prompted by a drag.
  • Figure 2: The Overall Pipeline of DragAPart. (a) Our model takes as input a single RGB image $y$ and one or more drags $\mathcal{D}$, and generates a second image $x$ that reflects the effect of the drags (\ref{['sec:preliminaries']}). (b) We propose a novel flow encoder (\ref{['sec:arc_drag_conditions']}), which enables us to inject the motion control into the latent diffusion model at different resolutions more efficiently (The resolutions $4$ and $2$ are for illustrative purposes. Our model generates $256\times 256$ images, and the first two latent blocks have resolutions $32$ and $16$.). (c) At inference time, our model generalizes to real data, synthesizing physically-plausible part-level dynamics.
  • Figure 3: Animations from the Drag-a-Move dataset. We visualize two objects with diverse articulation states: the left is rendered with the original texture and the right with each part in a single random color.
  • Figure 4: Qualitative Comparisons on real images from the ABO collins2022abo dataset with manually defined drags (a-c) and the Human3.6M h36m_pami dataset (d) and a rendered image from our Drag-a-Move test split (e). The images generated by our model are more realistic and capture nuanced part-level dynamics.
  • Figure 5: Qualitative Ablations. The visual results are consistent with the numerical ones in \ref{['tab:compare-architecture', 'tab:compare-data']} and validate our design choices.
  • ...and 5 more figures