Table of Contents
Fetching ...

Learning to Animate Images from A Few Videos to Portray Delicate Human Actions

Haoxin Li, Yingchen Yu, Qilong Wu, Hanwang Zhang, Song Bai, Boyang Li

TL;DR

This work tackles the problem of animating static images into videos depicting delicate human actions using very few examples. It introduces FLASH, a few-shot framework with a Motion Alignment Module that learns appearance-general motion by reconciling motion patterns across differently appearance videos, and a Detail Enhancement Decoder that propagates reference-frame details for smooth transitions. Across 16 actions, FLASH outperforms baselines on multiple automatic metrics and is heavily preferred by human evaluators, while generalizing to diverse and non-realistic references. The approach enables practical, controllable action animation with limited data, offering a path toward scalable, high-fidelity image-to-video generation for production workflows.

Abstract

Despite recent progress, video generative models still struggle to animate static images into videos that portray delicate human actions, particularly when handling uncommon or novel actions whose training data are limited. In this paper, we explore the task of learning to animate images to portray delicate human actions using a small number of videos -- 16 or fewer -- which is highly valuable for real-world applications like video and movie production. Learning generalizable motion patterns that smoothly transition from user-provided reference images in a few-shot setting is highly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which learns generalizable motion patterns by forcing the model to reconstruct a video using the motion features and cross-frame correspondences of another video with the same motion but different appearance. This encourages transferable motion learning and mitigates overfitting to limited training data. Additionally, FLASH extends the decoder with additional layers to propagate details from the reference image to generated frames, improving transition smoothness. Human judges overwhelmingly favor FLASH, with 65.78\% of 488 responses prefer FLASH over baselines. We strongly recommend watching the videos in the website: https://lihaoxin05.github.io/human_action_animation/, as motion artifacts are hard to notice from images.

Learning to Animate Images from A Few Videos to Portray Delicate Human Actions

TL;DR

This work tackles the problem of animating static images into videos depicting delicate human actions using very few examples. It introduces FLASH, a few-shot framework with a Motion Alignment Module that learns appearance-general motion by reconciling motion patterns across differently appearance videos, and a Detail Enhancement Decoder that propagates reference-frame details for smooth transitions. Across 16 actions, FLASH outperforms baselines on multiple automatic metrics and is heavily preferred by human evaluators, while generalizing to diverse and non-realistic references. The approach enables practical, controllable action animation with limited data, offering a path toward scalable, high-fidelity image-to-video generation for production workflows.

Abstract

Despite recent progress, video generative models still struggle to animate static images into videos that portray delicate human actions, particularly when handling uncommon or novel actions whose training data are limited. In this paper, we explore the task of learning to animate images to portray delicate human actions using a small number of videos -- 16 or fewer -- which is highly valuable for real-world applications like video and movie production. Learning generalizable motion patterns that smoothly transition from user-provided reference images in a few-shot setting is highly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which learns generalizable motion patterns by forcing the model to reconstruct a video using the motion features and cross-frame correspondences of another video with the same motion but different appearance. This encourages transferable motion learning and mitigates overfitting to limited training data. Additionally, FLASH extends the decoder with additional layers to propagate details from the reference image to generated frames, improving transition smoothness. Human judges overwhelmingly favor FLASH, with 65.78\% of 488 responses prefer FLASH over baselines. We strongly recommend watching the videos in the website: https://lihaoxin05.github.io/human_action_animation/, as motion artifacts are hard to notice from images.

Paper Structure

This paper contains 27 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Comparison of animated human action videos produced by KLING AI, Wanx AI and FLASH (our method). In the balance beam jump action, Wanx AI produces physics-defying movements, whereas KLING AI generates a jump but fails to portray the standard jump on the balance beam. For the soccer shooting action, both KLING AI and Wanx AI struggle to generate the correct shooting motion and the person never kicks the ball away. In contrast, FLASH successfully generates actions that resemble the real-world actions in the last row. We strongly recommend watching the animated videos in the Webpage, as motion artifacts can be hard to notice from static images.
  • Figure 2: An illustration of the Motion Alignment Module. Both the noised latent representations of the original and strongly augmented videos are input to the U-Net. In the temporal attention layers, static and motion features are extracted from both videos. Motion features from the original video are transferred to the augmented video (red arrows), and the recombined features are passed to the next layer. In the cross-frame attention layers, attention scores from the original video, which capture its cross-frame motion structure, are used to warp the augmented video (red arrow) before passing it to the next layer. The U-Net is trained to predict the noise added to both videos based on the motion patterns of the original video, encouraging the learning of consistent motion patterns.
  • Figure 3: The percentage of users that choose products of each video generator as the best videos in the user study on Amazon Mechanical Turk. The proposed method, FLASH, received the vast majority of votes.
  • Figure 4: Qualitative comparison of different methods. We strongly recommend watching the animated videos in the Webpage, as motion artifacts are hard to notice from static images.
  • Figure 5: Animated videos from FLASH using reference images from the Internet and generated by Stable Diffusion 3.
  • ...and 6 more figures