Table of Contents
Fetching ...

MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, Feng Liu

TL;DR

MotionCanvas tackles the problem of intuitive, scene-aware shot design for image-to-video generation by introducing a three-part pipeline: a Motion Design Module that captures 3D-structured user intents, a Motion Signal Translation Module that converts scene-space motions into robust 2D screen-space conditioning, and a DiT-based motion-conditioned video generation model that fuses DCT-based trajectories, scene-space bounding-box cues, and text prompts. The key innovation is converting 3D motion intent into reliable 2D conditioning without requiring expensive 3D annotations, enabling joint camera and object control through depth-aware, scene-anchored representations. Extensive experiments demonstrate strong performance in camera motion accuracy, 3D-aware object motion control, and joint motion fidelity, with ablations supporting the effectiveness of the proposed motion representations and conditioning strategies. The work has practical impact for cinematic shot design and editing, expanding the creative toolkit for image-to-video synthesis while highlighting avenues for efficiency and explicit prompt-motion harmonization in future work.

Abstract

This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations. To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.

MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

TL;DR

MotionCanvas tackles the problem of intuitive, scene-aware shot design for image-to-video generation by introducing a three-part pipeline: a Motion Design Module that captures 3D-structured user intents, a Motion Signal Translation Module that converts scene-space motions into robust 2D screen-space conditioning, and a DiT-based motion-conditioned video generation model that fuses DCT-based trajectories, scene-space bounding-box cues, and text prompts. The key innovation is converting 3D motion intent into reliable 2D conditioning without requiring expensive 3D annotations, enabling joint camera and object control through depth-aware, scene-anchored representations. Extensive experiments demonstrate strong performance in camera motion accuracy, 3D-aware object motion control, and joint motion fidelity, with ablations supporting the effectiveness of the proposed motion representations and conditioning strategies. The work has practical impact for cinematic shot design and editing, expanding the creative toolkit for image-to-video synthesis while highlighting avenues for efficiency and explicit prompt-motion harmonization in future work.

Abstract

This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations. To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.

Paper Structure

This paper contains 28 sections, 3 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: MotionCanvas offers comprehensive motion controls to animate a static image (the "Inputs" column) with various types of camera movements and object motions. Note the different camera movements across columns and object motions across rows. Please use Adobe Acrobat Reader for video playback.
  • Figure 2: Overview of MotionCanvas. Given an input image and high-level scene-space motion intent, MotionCanvas decomposes and translates the motion (camera and object motion with their timing) into screen space by leveraging the depth-based synthesis and hierarchical transformation with the Motion Signal Translation module. These screen-space motion signals are subsequently passed to a video generation model to produce the final cinematic shots.
  • Figure 3: Illustration of our motion-conditioned video generation model. The input image and bbox color frames are tokenized via a 3D-VAE encoder and then summed. The resultant tokens are concatenated with other conditional tokens, and fed into the DiT-based video generation model.
  • Figure 4: Shot design generated by our MotionCanvas under various types of joint camera and object motion controls.
  • Figure 5: Long videos with the same complex sequences of camera motion while different object motion controls in each case generated by our MotionCanvas.
  • ...and 11 more figures