Table of Contents
Fetching ...

InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions

Yiyuan Zhang, Yuhao Kang, Zhixin Zhang, Xiaohan Ding, Sanyuan Zhao, Xiangyu Yue

TL;DR

InteractiveVideo tackles the challenge of aligning video generation with nuanced human intent by moving beyond static image/text conditioning to multimodal, interactive guidance. It introduces a training-free Synergistic Multimodal Instruction mechanism that injects user edits as denoising residuals within a two-pipeline diffusion framework (T2I and I2V), enabling precise control over content, semantics, and motion. Empirical results on personalization, editing precision, and motion control show improvements over Gen-2, I2VGen-XL, and Pika Labs, with strong user satisfaction and efficient inference on commodity GPUs. The approach broadens practical video creation workflows and has implications for education, entertainment, and AR/VR, while maintaining responsible AI practices.

Abstract

We introduce $\textit{InteractiveVideo}$, a user-centric framework for video generation. Different from traditional generative approaches that operate based on user-provided images or text, our framework is designed for dynamic interaction, allowing users to instruct the generative model through various intuitive mechanisms during the whole generation process, e.g. text and image prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal Instruction mechanism, designed to seamlessly integrate users' multimodal instructions into generative models, thus facilitating a cooperative and responsive interaction between user inputs and the generative process. This approach enables iterative and fine-grained refinement of the generation result through precise and effective user instructions. With $\textit{InteractiveVideo}$, users are given the flexibility to meticulously tailor key aspects of a video. They can paint the reference image, edit semantics, and adjust video motions until their requirements are fully met. Code, models, and demo are available at https://github.com/invictus717/InteractiveVideo

InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions

TL;DR

InteractiveVideo tackles the challenge of aligning video generation with nuanced human intent by moving beyond static image/text conditioning to multimodal, interactive guidance. It introduces a training-free Synergistic Multimodal Instruction mechanism that injects user edits as denoising residuals within a two-pipeline diffusion framework (T2I and I2V), enabling precise control over content, semantics, and motion. Empirical results on personalization, editing precision, and motion control show improvements over Gen-2, I2VGen-XL, and Pika Labs, with strong user satisfaction and efficient inference on commodity GPUs. The approach broadens practical video creation workflows and has implications for education, entertainment, and AR/VR, while maintaining responsible AI practices.

Abstract

We introduce , a user-centric framework for video generation. Different from traditional generative approaches that operate based on user-provided images or text, our framework is designed for dynamic interaction, allowing users to instruct the generative model through various intuitive mechanisms during the whole generation process, e.g. text and image prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal Instruction mechanism, designed to seamlessly integrate users' multimodal instructions into generative models, thus facilitating a cooperative and responsive interaction between user inputs and the generative process. This approach enables iterative and fine-grained refinement of the generation result through precise and effective user instructions. With , users are given the flexibility to meticulously tailor key aspects of a video. They can paint the reference image, edit semantics, and adjust video motions until their requirements are fully met. Code, models, and demo are available at https://github.com/invictus717/InteractiveVideo
Paper Structure (18 sections, 5 equations, 9 figures, 2 tables)

This paper contains 18 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Interactive Video Generation We propose a user-centric framework that effectively synergizes users' multimodal instructions. Users can easily edit key components in the video generation process, leading to high-quality video and increased user satisfaction.
  • Figure 2: Comparison between Gen-2 and InteractiveVideo. For each case, the first row is the generation results of Gen-2, and the second row is our results. (More comparison results with Pika Labs, I2VGen-XL zhang2023i2vgen-xl, and Gen-2 can be found in Appendix Figures \ref{['fig:6']}, \ref{['fig:7']}, and \ref{['fig:8']}.)
  • Figure 3: Framework Illustration. In InteractiveVideo, users can utilize multimodal instructions to interact with generative models on video content, motion, and trajectory.
  • Figure 4: Video Content Manipulation with InteractiveVideo. In (a), (b), and (c), we present the content manipulation by adding birds, waves, and polar lights. Then, these added objects are driven in the whole video. We use these results to show the flexibility of our framework for video content creation.
  • Figure 5: Fine-grained Video Editing with InteractiveVideo. In (a), (b), and (c), we perform fine-grained regional semantic editing on changing colors and appearances of specific objects, These results show the outstanding controllability of our framework for video generation.
  • ...and 4 more figures