Table of Contents
Fetching ...

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao, Haitian Zheng, Qing Liu, He Zhang, Zhe Lin, Yuqian Zhou, Jiebo Luo

Abstract

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Abstract

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.
Paper Structure (21 sections, 3 equations, 11 figures, 3 tables)

This paper contains 21 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: The proposed Tri-Prompting framework unifies scene control, multi-view subject control, and scene–subject motion control within a single video diffusion model. Users can select a subject, insert/manipulate it in any scene, and control both the camera pose and the character motion in a natural and physically aware manner using the keyboard, while maintaining appearance consistency that matches the provided reference images. The figure illustrates cases where users control only the scene/camera motion, only the subject motion, and the joint scene–subject motion, respectively. Yellow arrow → indicates camera trajectory, and Red arrow → indicates subject motion.
  • Figure 2: Overview of the proposed Tri-Prompting. Tri-Prompting unifies video diffusion with first-frame/multi-view images and dual-conditioning motion anchor to jointly control the scene (where), subject (who), and motion (how). We employ a two-stage training paradigm: first optimizing LoRA for scene and subject control, followed by ControlNet finetuning for motion control. The framework preserves multi-view identity while achieving disentangled control between the foreground and background.
  • Figure 3: Comparison with DaS and Phantom. (a) Motion control: Unlike DaS, which hallucinates content as tracking points disappear, Tri-Prompting maintains robust alignment under extreme motion. (b) Multi-view identity: Tri-Prompting eliminates Phantom's structural distortions (e.g., backward-facing astronaut and teddy bear) by preserving multi-view identity and 3D consistency. Our multi-view fusion resolves coarse motion proxies into detailed, 3D-consistent subjects.
  • Figure 4: Applications. Tri-Prompting supports both insertion and manipulation of 3D subjects and joint control. Under camera or subject pose changes, the interactions between the subject and background remain natural and exhibit plausible non-rigid motion (e.g., walking). The multi-view identity is also preserved, providing strong 3D consistency. Please see more video results in appendix.
  • Figure 5: Effect of multi-view subject images. Both multi-view ID and 3D-consistency improve with more views. This demonstrates that a single-view image is insufficient for maintaining subject integrity and identity preservation during video generation.
  • ...and 6 more figures