Table of Contents
Fetching ...

Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation

Lingyu Liu, Yaxiong Wang, Li Zhu, Zhedong Zheng

TL;DR

This work tackles the problem of animating real-world paintings with image-to-video diffusion models that are traditionally trained on natural videos. It introduces a training-free framework that uses synthetic proxy images generated by powerful image diffusion models to guide text-driven motion while preserving painting fidelity. The core innovations are dual-path score distillation, which separately refines motion priors from real paintings and proxy proxies, and hybrid latent fusion via spherical linear interpolation to produce temporally coherent animations. The approach is plug-and-play, requiring no additional learnable parameters, and demonstrates consistent improvements across multiple I2V baselines in both fidelity and semantic alignment with text prompts. It also extends to natural images and offers insights into synthesis strategies and limitations, highlighting practical impact for digital art animation and related applications.

Abstract

We introduce a training-free framework specifically designed to bring real-world static paintings to life through image-to-video (I2V) synthesis, addressing the persistent challenge of aligning these motions with textual guidance while preserving fidelity to the original artworks. Existing I2V methods, primarily trained on natural video datasets, often struggle to generate dynamic outputs from static paintings. It remains challenging to generate motion while maintaining visual consistency with real-world paintings. This results in two distinct failure modes: either static outputs due to limited text-based motion interpretation or distorted dynamics caused by inadequate alignment with real-world artistic styles. We leverage the advanced text-image alignment capabilities of pre-trained image models to guide the animation process. Our approach introduces synthetic proxy images through two key innovations: (1) Dual-path score distillation: We employ a dual-path architecture to distill motion priors from both real and synthetic data, preserving static details from the original painting while learning dynamic characteristics from synthetic frames. (2) Hybrid latent fusion: We integrate hybrid features extracted from real paintings and synthetic proxy images via spherical linear interpolation in the latent space, ensuring smooth transitions and enhancing temporal consistency. Experimental evaluations confirm that our approach significantly improves semantic alignment with text prompts while faithfully preserving the unique characteristics and integrity of the original paintings. Crucially, by achieving enhanced dynamic effects without requiring any model training or learnable parameters, our framework enables plug-and-play integration with existing I2V methods, making it an ideal solution for animating real-world paintings. More animated examples can be found on our project website.

Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation

TL;DR

This work tackles the problem of animating real-world paintings with image-to-video diffusion models that are traditionally trained on natural videos. It introduces a training-free framework that uses synthetic proxy images generated by powerful image diffusion models to guide text-driven motion while preserving painting fidelity. The core innovations are dual-path score distillation, which separately refines motion priors from real paintings and proxy proxies, and hybrid latent fusion via spherical linear interpolation to produce temporally coherent animations. The approach is plug-and-play, requiring no additional learnable parameters, and demonstrates consistent improvements across multiple I2V baselines in both fidelity and semantic alignment with text prompts. It also extends to natural images and offers insights into synthesis strategies and limitations, highlighting practical impact for digital art animation and related applications.

Abstract

We introduce a training-free framework specifically designed to bring real-world static paintings to life through image-to-video (I2V) synthesis, addressing the persistent challenge of aligning these motions with textual guidance while preserving fidelity to the original artworks. Existing I2V methods, primarily trained on natural video datasets, often struggle to generate dynamic outputs from static paintings. It remains challenging to generate motion while maintaining visual consistency with real-world paintings. This results in two distinct failure modes: either static outputs due to limited text-based motion interpretation or distorted dynamics caused by inadequate alignment with real-world artistic styles. We leverage the advanced text-image alignment capabilities of pre-trained image models to guide the animation process. Our approach introduces synthetic proxy images through two key innovations: (1) Dual-path score distillation: We employ a dual-path architecture to distill motion priors from both real and synthetic data, preserving static details from the original painting while learning dynamic characteristics from synthetic frames. (2) Hybrid latent fusion: We integrate hybrid features extracted from real paintings and synthetic proxy images via spherical linear interpolation in the latent space, ensuring smooth transitions and enhancing temporal consistency. Experimental evaluations confirm that our approach significantly improves semantic alignment with text prompts while faithfully preserving the unique characteristics and integrity of the original paintings. Crucially, by achieving enhanced dynamic effects without requiring any model training or learnable parameters, our framework enables plug-and-play integration with existing I2V methods, making it an ideal solution for animating real-world paintings. More animated examples can be found on our project website.

Paper Structure

This paper contains 18 sections, 4 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: The animation of AnimateAnything (top) vs. our method (bottom). Given the prompt "Clouds drift across the sky" and a consistent global seed, our method, which utilizes a refined synthetic proxy image for future guidance, shows a significantly improved response to the prompt compared to the base I2V model.
  • Figure 2: Visualization of different video latent vectors. We employ t-SNE to visualize the latent vectors of different videos in the feature space, where each point represents a video frame. Frames from the same video form a single cluster. Clusters formed by real videos from the MSR-VTT dataset exhibit a linear trend, indicating that frames of real videos are orderly arranged in their feature space. In contrast, clusters formed by videos synthesized using AnimateAnything are tightly packed and do not show a clear single linear trend. Clusters generated by videos recreated using our method share a similar linear trend with those of the real videos. This suggests that our approach achieves high fidelity and continuity in capturing video features.
  • Figure 3: Illustration of our method. Given a real painting and its synthetic counterpart refined by an image diffusion model, we apply dual-path video score distillation sampling to infuse their latent vectors with motion information. Next, we perform spherical linear interpolation on these updated latent vectors across the frame dimension. The hybrid latent vectors are then fed into the I2V model to generate dynamic videos.
  • Figure 4: Qualitative comparisons with current image-to-video methods. AnimateAnything fails to interpret the text prompt's request for the "boat” to move, while our method can generate a video showing the boat moving forward. Compared to ConsistI2V, our approach produces more pronounced and superior motion effects for "smoke”. Cinemo struggles with both retaining the input image's information and understanding the prompt's intent. By incorporating our inference method, it not only preserves the input image's style but also successfully animates the "leaves" to flutter.
  • Figure 5: Ablation study of the two adjustable components. Our full model (a) accurately interprets the semantic content of the text prompt, effectively generating a visual sensation of sunlight penetrating the scene and ensuring that the light source originates from the side with the tree branches. The variant employing uniform linear interpolation (b) fails to adequately capture variations in lighting conditions. The variant that solely performs video score distillation sampling (c) struggles to accurately interpret the semantics of the text prompt, resulting in static video outputs. The variant utilizing only spherical linear interpolation (d) shows erroneous modifications during the latter stages of generation.
  • ...and 6 more figures