C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
Yuhao Li, Mirana Claire Angel, Salman Khan, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan
TL;DR
C-Drag addresses the challenge of generating controllable videos with realistic multi-object interactions by integrating a perception-first pipeline with Chain-of-Thought based motion reasoning to produce interaction-aware trajectories from a single image and a drag trajectory, which are then fed to a pre-trained trajectory diffusion generator. The method is training-free and relies on a Vision-Language Model (VLM) and object-perception module to identify objects, followed by a five-stage CoT reasoning process to infer how objects interact (e.g., collisions, gravity, mirrors) and predict trajectories for all moving objects. A new VOI benchmark (72 videos across three interaction types) with ground-truth trajectories is introduced, along with the MOC metric to quantify motion consistency across all objects; experiments show C-Drag outperforms prior trajectory-based methods in FVD, FID, and MOC on VOI. Overall, C-Drag advances controllable video generation by embedding structured, stage-wise reasoning about object interactions into the trajectory-based synthesis pipeline, enabling more realistic multi-object dynamics and providing a practical benchmark for evaluation.
Abstract
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at https://github.com/WesLee88524/C-Drag-Official-Repo.
