Table of Contents
Fetching ...

C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation

Yuhao Li, Mirana Claire Angel, Salman Khan, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan

TL;DR

C-Drag addresses the challenge of generating controllable videos with realistic multi-object interactions by integrating a perception-first pipeline with Chain-of-Thought based motion reasoning to produce interaction-aware trajectories from a single image and a drag trajectory, which are then fed to a pre-trained trajectory diffusion generator. The method is training-free and relies on a Vision-Language Model (VLM) and object-perception module to identify objects, followed by a five-stage CoT reasoning process to infer how objects interact (e.g., collisions, gravity, mirrors) and predict trajectories for all moving objects. A new VOI benchmark (72 videos across three interaction types) with ground-truth trajectories is introduced, along with the MOC metric to quantify motion consistency across all objects; experiments show C-Drag outperforms prior trajectory-based methods in FVD, FID, and MOC on VOI. Overall, C-Drag advances controllable video generation by embedding structured, stage-wise reasoning about object interactions into the trajectory-based synthesis pipeline, enabling more realistic multi-object dynamics and providing a practical benchmark for evaluation.

Abstract

Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at https://github.com/WesLee88524/C-Drag-Official-Repo.

C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation

TL;DR

C-Drag addresses the challenge of generating controllable videos with realistic multi-object interactions by integrating a perception-first pipeline with Chain-of-Thought based motion reasoning to produce interaction-aware trajectories from a single image and a drag trajectory, which are then fed to a pre-trained trajectory diffusion generator. The method is training-free and relies on a Vision-Language Model (VLM) and object-perception module to identify objects, followed by a five-stage CoT reasoning process to infer how objects interact (e.g., collisions, gravity, mirrors) and predict trajectories for all moving objects. A new VOI benchmark (72 videos across three interaction types) with ground-truth trajectories is introduced, along with the MOC metric to quantify motion consistency across all objects; experiments show C-Drag outperforms prior trajectory-based methods in FVD, FID, and MOC on VOI. Overall, C-Drag advances controllable video generation by embedding structured, stage-wise reasoning about object interactions into the trajectory-based synthesis pipeline, enabling more realistic multi-object dynamics and providing a practical benchmark for evaluation.

Abstract

Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at https://github.com/WesLee88524/C-Drag-Official-Repo.

Paper Structure

This paper contains 13 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Our C-Drag employs a single trajectory control signal (red arrow), integrated with a vision-language model (VLM) and Chain-of-Thought (CoT) reasoning, to generate controllable videos that emphasize motion realism. Results are illustrated in three example scenarios, each comprising two rows: baseline output (top) and C-Drag output (bottom). (a) Collision and Chain Reaction: The trajectory of a single sphere leads to complex collisions and chain reactions among multiple spheres. (b) Gravity and Force: A foot’s trajectory impacts a football, showing motion under gravitational and force dynamics. (c) Levers and Mirrors: A puppy's movement is reflected in a mirror, showcasing coupled motion control through mirror reflection. Best viewed zoomed in. Additional results are presented in suppl. material.
  • Figure 2: Our C-Drag approach is motivated from human cognitive patterns to model dynamic interactions between objects for controllable video generation. Human reasoning about object interactions involves few key stages. First, obtaining information about the image and objects. Next, inferring relationships between objects. Then, based on the trajectory of a specific object and motion principles, predicting the reactions of other objects. Finally, determining the overall result of these interactions.
  • Figure 3: Overview of our C-Drag. C-Drag first takes a single RGB image and one or more drag motion trajectories as input. We employ an object perception module to obtain information about all related objects in the image. Chain-of-Thought (CoT)-based reasoning module introduces a reasoning strategy to precisely reason motion trajectories of all objects according to the detected position and category information. With the generated object trajectories, we use a pre-trained trajectory-based generation model to generate the videos with multiple-object interactions.
  • Figure 4: An illustrative view of CoT-based Motion Reasoning Module which undergoes a five-stage reasoning process. Scene and Object Understanding, where a pre-trained visual language model (VLM) interprets the scene and establishes motion rules using formated information from Object Perspection Module. In Reasoning Object Relationship, the VLM identifies spatial relationships and potential interactions among objects to inform trajectory predictions. Interaction Trajectories Reasoning follows, categorizing interactions (e.g., collisions, forces) and predicting affected object paths. During Iterative Reasoning and Ranking, initial predictions are iteratively optimized, with the VLM selecting the most consistent motion sequences. Finally, in Validation and Final Reasoning Outcome, forward and backward validation ensures predicted trajectories align with scene rules, iterating until accuracy is achieved.
  • Figure 5: Qualitative comparison of our C-Drag with existing methods. PhysGen liu2024physgen (Rows 1, 5, 9) struggles with deformable objects and non-planar scenarios, requiring extensive manual parameter tuning, which leads to unrealistic movements in complex scenes. For example, in Row 5, the character falls since the seesaw boundary is not set, causing incorrect interactions. Similarly, both DragAnything wu2024draganything (Rows 2, 6, 10) and DragNUWA yin2023dragnuwa (Rows 3, 7, 11) have issues when uncontrolled objects lose temporal consistency, such as merged birds in Rows 2 and 3, and severely deformed mirror-reflected objects in Rows 10 and 11. In contrast, our C-Drag not only perceives and infers the movements of all objects but also maintains temporal consistency for all elements. Additional results are presented in suppl. material.