MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, Feng Liu
TL;DR
MotionCanvas tackles the problem of intuitive, scene-aware shot design for image-to-video generation by introducing a three-part pipeline: a Motion Design Module that captures 3D-structured user intents, a Motion Signal Translation Module that converts scene-space motions into robust 2D screen-space conditioning, and a DiT-based motion-conditioned video generation model that fuses DCT-based trajectories, scene-space bounding-box cues, and text prompts. The key innovation is converting 3D motion intent into reliable 2D conditioning without requiring expensive 3D annotations, enabling joint camera and object control through depth-aware, scene-anchored representations. Extensive experiments demonstrate strong performance in camera motion accuracy, 3D-aware object motion control, and joint motion fidelity, with ablations supporting the effectiveness of the proposed motion representations and conditioning strategies. The work has practical impact for cinematic shot design and editing, expanding the creative toolkit for image-to-video synthesis while highlighting avenues for efficiency and explicit prompt-motion harmonization in future work.
Abstract
This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations. To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.
