Table of Contents
Fetching ...

Controllable Longer Image Animation with Diffusion Models

Qiang Wang, Minghua Liu, Junjun Hu, Fan Jiang, Mu Xu

TL;DR

The paper tackles the challenge of generating realistic, long-duration videos from a single image with precise, open-domain control over motion. It introduces diffusion-model-based image animation guided by motion priors: object-level motion fields derived from sparse trajectories and a global motion strength embedding, plus a refinement module for dense flow. A key innovation is longer-video generation via phased inference and shared-noise reschedule, which preserves content contours while varying motion across segments. Empirical results show state-of-the-art performance against baselines on both automatic metrics (FVD, PSNR, SSIM, LPIPS, Tem-Cons) and human judgments, with significant gains in temporal coherence for extended sequences. The approach enables robust, controllable animation beyond texture-focused domains, with potential for practical applications in film, advertising, and interactive media, and points to future work on richer multi-condition controls such as sketch or depth cues.

Abstract

Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: https://wangqiang9.github.io/Controllable.github.io/

Controllable Longer Image Animation with Diffusion Models

TL;DR

The paper tackles the challenge of generating realistic, long-duration videos from a single image with precise, open-domain control over motion. It introduces diffusion-model-based image animation guided by motion priors: object-level motion fields derived from sparse trajectories and a global motion strength embedding, plus a refinement module for dense flow. A key innovation is longer-video generation via phased inference and shared-noise reschedule, which preserves content contours while varying motion across segments. Empirical results show state-of-the-art performance against baselines on both automatic metrics (FVD, PSNR, SSIM, LPIPS, Tem-Cons) and human judgments, with significant gains in temporal coherence for extended sequences. The approach enables robust, controllable animation beyond texture-focused domains, with potential for practical applications in film, advertising, and interactive media, and points to future work on richer multi-condition controls such as sketch or depth cues.

Abstract

Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: https://wangqiang9.github.io/Controllable.github.io/
Paper Structure (15 sections, 12 equations, 8 figures, 2 tables)

This paper contains 15 sections, 12 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Examples of our method for image animation. The first column displays the input reference image in conjunction with the arrow controls, serving as motion control. The second column depicts the refined motion field based on the directional information provided by the input arrows. The final column showcases selected frames from the generated animation sequence, specifically frames 4, 8, 12, 16, 20 and 24.
  • Figure 2: Overview of motion fields guidance: (a) Training stage: We extract optical flow motion field and motion strength from training videos as conditional constraints. The motion field is enhanced through a spatio-temporal layer attention mechanism, while the motion intensity is projected into positional embeddings and concatenated with timestep embeddings. (b) Inference stage: The control arrow provided by the user is initially transformed into a sparse motion field, and then convert to dense motion field by interpolation. Subsequently, the refined motion field is produced by employing a refinement model. The motion field, in conjunction with the input motion strength, regulates the video generation.
  • Figure 3: Variability in noise patterns and contour accuracy is evident across different timesteps. The upper part of the curve graph illustrates the visual outcomes at every 20-step interval.
  • Figure 4: Qualitative results between baselines and our approach. Additional examples are provided in the supplemental material.
  • Figure 5: Qualitative results of longer video generation between baselines and our approach. The frames presented are 25, 50, 75, and 100 from the generated video.
  • ...and 3 more figures