DragAnything: Motion Control for Anything using Entity Representation
Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang
TL;DR
DragAnything tackles trajectory-based motion control in controllable video generation by introducing an open-domain, entity-level representation sourced from diffusion latent features. It combines an entity embedding with a 2D Gaussian map and injects these through a diffusion-based backbone via a ControlNet-like encoder to achieve precise entity- and background-aware motion control. The approach yields state-of-the-art results on FVD, FID, and human motion-control voting, showing substantial improvements over DragNUWA and enabling simultaneous control of multiple objects. Limitations include 2D motion constraints and reliance on the foundation model, with future work aimed at 3D trajectories and stronger diffusion models for larger motions.
Abstract
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw a line (trajectory) during interaction. Secondly, our entity representation serves as an open-domain embedding capable of representing any object, enabling the control of motion for diverse entities, including background. Lastly, our entity representation allows simultaneous and distinct motion control for multiple objects. Extensive experiments demonstrate that our DragAnything achieves state-of-the-art performance for FVD, FID, and User Study, particularly in terms of object motion control, where our method surpasses the previous methods (e.g., DragNUWA) by 26% in human voting.
