Table of Contents
Fetching ...

DragAnything: Motion Control for Anything using Entity Representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang

TL;DR

DragAnything tackles trajectory-based motion control in controllable video generation by introducing an open-domain, entity-level representation sourced from diffusion latent features. It combines an entity embedding with a 2D Gaussian map and injects these through a diffusion-based backbone via a ControlNet-like encoder to achieve precise entity- and background-aware motion control. The approach yields state-of-the-art results on FVD, FID, and human motion-control voting, showing substantial improvements over DragNUWA and enabling simultaneous control of multiple objects. Limitations include 2D motion constraints and reliance on the foundation model, with future work aimed at 3D trajectories and stronger diffusion models for larger motions.

Abstract

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw a line (trajectory) during interaction. Secondly, our entity representation serves as an open-domain embedding capable of representing any object, enabling the control of motion for diverse entities, including background. Lastly, our entity representation allows simultaneous and distinct motion control for multiple objects. Extensive experiments demonstrate that our DragAnything achieves state-of-the-art performance for FVD, FID, and User Study, particularly in terms of object motion control, where our method surpasses the previous methods (e.g., DragNUWA) by 26% in human voting.

DragAnything: Motion Control for Anything using Entity Representation

TL;DR

DragAnything tackles trajectory-based motion control in controllable video generation by introducing an open-domain, entity-level representation sourced from diffusion latent features. It combines an entity embedding with a 2D Gaussian map and injects these through a diffusion-based backbone via a ControlNet-like encoder to achieve precise entity- and background-aware motion control. The approach yields state-of-the-art results on FVD, FID, and human motion-control voting, showing substantial improvements over DragNUWA and enabling simultaneous control of multiple objects. Limitations include 2D motion constraints and reliance on the foundation model, with future work aimed at 3D trajectories and stronger diffusion models for larger motions.

Abstract

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw a line (trajectory) during interaction. Secondly, our entity representation serves as an open-domain embedding capable of representing any object, enabling the control of motion for diverse entities, including background. Lastly, our entity representation allows simultaneous and distinct motion control for multiple objects. Extensive experiments demonstrate that our DragAnything achieves state-of-the-art performance for FVD, FID, and User Study, particularly in terms of object motion control, where our method surpasses the previous methods (e.g., DragNUWA) by 26% in human voting.
Paper Structure (28 sections, 3 equations, 11 figures, 3 tables)

This paper contains 28 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Comparison with Previous Works. (a) Previous works (Motionctrl wang2023motionctrl, DragNUWA yin2023dragnuwa) achieved motion control by dragging pixel points or pixel regions. (b) DragAnything enables more precise entity-level motion control by manipulating the corresponding entity representation.
  • Figure 2: Comparison for Different Representation Modeling. (a) Point representation: using a coordinate point $(x,y)$ to represent an entity. (b) Trajectory Map: using a trajectory vector map to represent the trajectory of the entity. (c) 2D gaussian: using a 2D Gaussian map to represent an entity. (c) Box representation: using a bounding box to represent an entity. (d) Entity representation: extracting the latent diffusion feature of the entity to characterize it.
  • Figure 3: Toy experiment for the motivation of Entity Representation. Existing methods (DragNUWA yin2023dragnuwa and MotionCtrl wang2023motionctrl) directly drag pixels, which cannot precisely control object targets, whereas our method employs entity representation to achieve precise control.
  • Figure 4: DragAnything Framework. The architecture includes two parts: 1) Entity Semantic Representation Extraction. Latent features from the Diffusion Model are extracted based on entity mask indices to serve as corresponding entity representations. 2) Main Framework for DragAnything. Utilizing the corresponding entity representations and 2D Gaussian representations to control the motion of entities.
  • Figure 5: Illustration of ground truth generation procedure. During the training process, we generate ground truth labels from video segmentation datasets that have entity-level annotations.
  • ...and 6 more figures