Table of Contents
Fetching ...

DragEntity: Trajectory Guided Video Generation using Entity and Positional Relationships

Zhang Wan, Sheng Tang, Jiawei Wei, Ruize Zhang, Juan Cao

TL;DR

DragEntity is introduced, a video generation model that utilizes entity representation for controlling the motion of multiple objects and is more user-friendly for interaction because it allows users to drag entities within the image rather than individual pixels.

Abstract

In recent years, diffusion models have achieved tremendous success in the field of video generation, with controllable video generation receiving significant attention. However, existing control methods still face two limitations: Firstly, control conditions (such as depth maps, 3D Mesh) are difficult for ordinary users to obtain directly. Secondly, it's challenging to drive multiple objects through complex motions with multiple trajectories simultaneously. In this paper, we introduce DragEntity, a video generation model that utilizes entity representation for controlling the motion of multiple objects. Compared to previous methods, DragEntity offers two main advantages: 1) Our method is more user-friendly for interaction because it allows users to drag entities within the image rather than individual pixels. 2) We use entity representation to represent any object in the image, and multiple objects can maintain relative spatial relationships. Therefore, we allow multiple trajectories to control multiple objects in the image with different levels of complexity simultaneously. Our experiments validate the effectiveness of DragEntity, demonstrating its excellent performance in fine-grained control in video generation.

DragEntity: Trajectory Guided Video Generation using Entity and Positional Relationships

TL;DR

DragEntity is introduced, a video generation model that utilizes entity representation for controlling the motion of multiple objects and is more user-friendly for interaction because it allows users to drag entities within the image rather than individual pixels.

Abstract

In recent years, diffusion models have achieved tremendous success in the field of video generation, with controllable video generation receiving significant attention. However, existing control methods still face two limitations: Firstly, control conditions (such as depth maps, 3D Mesh) are difficult for ordinary users to obtain directly. Secondly, it's challenging to drive multiple objects through complex motions with multiple trajectories simultaneously. In this paper, we introduce DragEntity, a video generation model that utilizes entity representation for controlling the motion of multiple objects. Compared to previous methods, DragEntity offers two main advantages: 1) Our method is more user-friendly for interaction because it allows users to drag entities within the image rather than individual pixels. 2) We use entity representation to represent any object in the image, and multiple objects can maintain relative spatial relationships. Therefore, we allow multiple trajectories to control multiple objects in the image with different levels of complexity simultaneously. Our experiments validate the effectiveness of DragEntity, demonstrating its excellent performance in fine-grained control in video generation.

Paper Structure

This paper contains 12 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Some examples. Input an image and a trajectory, and output a video. We select and display the 1st, 5th, 15th, and 20th frames for demonstration.
  • Figure 2: Comparison of different representation modeling methods: (a) Point Representation: Represents an entity using coordinate points (x, y). (b) Trajectory Graph: Represents the trajectory of an entity using a trajectory vector graph. (c) 2D Gaussian Distribution: Represents an entity using a two-dimensional Gaussian mapping. (d) Box Representation: Represents an entity using a bounding box. (e) Entity Representation: Represents an entity using latent features that include spatial relationships between objects.
  • Figure 3: Experiments on the motivation for entity representation. Existing methods (DragNUWA and MotionCtrl ) involve directly dragging pixels, which cannot precisely control the target, leading to camera motion or target structure distortion. In contrast, our method utilizes entity representation and models spatial relative positions to achieve accurate control.
  • Figure 4: Model Framework. This image consists of two parts: (a) Entity Semantic Representation Extraction. Latent features are extracted based on entity mask indices, integrating the relative spatial relationships between objects to form their respective entity representations. (b) Main Framework. Based on the SVDblattmann2023stable model, it utilizes the corresponding entity representations to precisely control motion.
  • Figure 5: Image Position-Aware Relationship Module. The entity representation includes more information about the relative spatial relationships between objects.
  • ...and 3 more figures