Table of Contents
Fetching ...

AnimateAnything: Consistent and Controllable Animation for Video Generation

Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu

TL;DR

This work tackles controllable video generation under multiple, diverse signals by unifying all controls into frame-by-frame optical flow. It introduces a two-stage diffusion-based pipeline: Stage 1 converts camera motion, drag annotations, and references into a single dense optical flow; Stage 2 leverages this flow as conditioning for high-quality video synthesis, reinforced by a frequency-domain stabilization module. The approach demonstrates superior performance over state-of-the-art methods on image-to-video and I2V tasks, while ablations validate the necessity of unified flow and spectral stabilization. The method offers robust, precise, and consistent video generation with broad applicability to film and virtual reality settings.

Abstract

We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into frame-by-frame optical flows. Then we incorporate the optical flows as motion priors to guide final video generation. In addition, to reduce the flickering issues caused by large-scale motion, we propose a frequency-based stabilization module. It can enhance temporal coherence by ensuring the video's frequency domain consistency. Experiments demonstrate that our method outperforms the state-of-the-art approaches. For more details and videos, please refer to the webpage: https://yu-shaonian.github.io/Animate_Anything/.

AnimateAnything: Consistent and Controllable Animation for Video Generation

TL;DR

This work tackles controllable video generation under multiple, diverse signals by unifying all controls into frame-by-frame optical flow. It introduces a two-stage diffusion-based pipeline: Stage 1 converts camera motion, drag annotations, and references into a single dense optical flow; Stage 2 leverages this flow as conditioning for high-quality video synthesis, reinforced by a frequency-domain stabilization module. The approach demonstrates superior performance over state-of-the-art methods on image-to-video and I2V tasks, while ablations validate the necessity of unified flow and spectral stabilization. The method offers robust, precise, and consistent video generation with broad applicability to film and virtual reality settings.

Abstract

We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into frame-by-frame optical flows. Then we incorporate the optical flows as motion priors to guide final video generation. In addition, to reduce the flickering issues caused by large-scale motion, we propose a frequency-based stabilization module. It can enhance temporal coherence by ensuring the video's frequency domain consistency. Experiments demonstrate that our method outperforms the state-of-the-art approaches. For more details and videos, please refer to the webpage: https://yu-shaonian.github.io/Animate_Anything/.

Paper Structure

This paper contains 13 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Animate anything. Consistent and controllable animation for different kinds of control signals. Given a reference image and corresponding user prompts, our approach can animate arbitrary characters, generating clear stable videos while maintaining consistency with the appearance details of the reference object.
  • Figure 2: The generated optical flow by our method with different condition signals. Given a specific image, from top to bottom are optical flows generated with camera trajectory, arrow-based motion annotation, and both conditions, respectively.
  • Figure 3: AnimateAnything Pipeline. The pipeline consists of two stages: 1) Unified Flow Generation, which creates a unified optical flow representation by leveraging visual control signals through two synchronized latent diffusion models, namely the Flow Generation Model (FGM) and the Camera Reference Model (CRM). The FGM accepts sparse or coarse optical flow derived from visual signals other than camera trajectory. The CRM inputs the encoded reference image and camera trajectory embedding to generate multi-level reference features. These features are fed into a reference attention layer to progressively guide the FGM's denoising process in each time step, producing a unified dense optical flow. 2) Video Generation, which compresses the generated unified flow with a 3D VAE encoder and integrates it with video latents from the image encoder using a single ViT block. The final output is then combined with text embeddings to generate the final video using the DiT blocks.
  • Figure 4: Video stabilization Module
  • Figure 5: Camera trajectory comparison with other trajectory-based methods
  • ...and 4 more figures