Table of Contents
Fetching ...

I2VControl: Disentangled and Unified Video Motion Synthesis Control

Wanquan Feng, Tianhao Qi, Jiawei Liu, Mingzhen Sun, Pengqi Tu, Tianxiang Ma, Fei Dai, Songtao Zhao, Siyu Zhou, Qian He

TL;DR

This work tackles the challenge of controllable video synthesis by addressing the fragmentation of motion controls across camera movement, object dragging, and motion brush. It proposes I2VControl, a unified framework that represents all controls as dense point trajectories, partitions the input into motion units, and employs an adapter network to integrate with pretrained diffusion models. Key contributions include consistent representation, spatial partitioning, a dual-task data pipeline, and training an adapter that enables conflict-free multi-type control, achieving strong performance across tasks and enabling creative combinations. The approach is shown to generalize across different base models, highlighting practical impact for flexible and user-driven video production.

Abstract

Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Project page: https://wanquanf.github.io/I2VControl .

I2VControl: Disentangled and Unified Video Motion Synthesis Control

TL;DR

This work tackles the challenge of controllable video synthesis by addressing the fragmentation of motion controls across camera movement, object dragging, and motion brush. It proposes I2VControl, a unified framework that represents all controls as dense point trajectories, partitions the input into motion units, and employs an adapter network to integrate with pretrained diffusion models. Key contributions include consistent representation, spatial partitioning, a dual-task data pipeline, and training an adapter that enables conflict-free multi-type control, achieving strong performance across tasks and enabling creative combinations. The approach is shown to generalize across different base models, highlighting practical impact for flexible and user-driven video production.

Abstract

Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Project page: https://wanquanf.github.io/I2VControl .

Paper Structure

This paper contains 21 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: We propose an all-in-one disentangled and unified framework for image-to-video motion synthesis control, named I2VControl. In the illustration, we show several scenarios of controls, including camera movement (camera dollies in and gets closer to the sculpture), motion brush (smoke flows in the wind, with a given motion strength value), object movement (the astronaut walks forward), and the combination of all above control types (camera tilts down; drag the apple; brush the squirrel). Users can select the control modes according to their requirements, where the control modes can be combined without conflict.
  • Figure 2: Examples of control conflicts in multiple control tasks. In (a), we try to drag the cat and apply camera tilt-up at the same time. We drag the cat towards the "right + down" direction; however, in the tilt-up view, the target position become much higher than the cat, which is an obfuscation. In (b), we try to employ motion brush and camera tilt-down together. We brush the human region, keeping the background fixed; however, in the tilt-down view, every background pixel moves, which is also a conflict.
  • Figure 3: Examples of dual tasks, where the dual perception task serves as the data pipeline of the controlling generation task.
  • Figure 4: Our data pipeline to convert RGB video into control signals and conduct the training. For more details, please see Sec. \ref{['sec:our_dual']}.
  • Figure 5: The process of obtaining the motion units.
  • ...and 6 more figures