Table of Contents
Fetching ...

SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers

Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, Xiang Wen

TL;DR

SkyReels-A1 addresses portrait animation by transferring facial expressions and motion from driving video to a reference portrait while preserving identity. It leverages a diffusion-transformer backbone with an expression-aware conditioning module and a cross-modal facial image-text alignment to tightly couple appearance and motion. A three-stage training regime progressively improves motion transfer, identity stability, and temporal coherence across diverse body proportions. Empirical results show superior image quality and motion fidelity compared with baselines, demonstrating robustness and broad applicability to virtual avatars and remote communication, with code and demos released publicly.

Abstract

We present SkyReels-A1, a simple yet effective framework built upon video diffusion Transformer to facilitate portrait image animation. Existing methodologies still encounter issues, including identity distortion, background instability, and unrealistic facial dynamics, particularly in head-only animation scenarios. Besides, extending to accommodate diverse body proportions usually leads to visual inconsistencies or unnatural articulations. To address these challenges, SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence. The system incorporates an expression-aware conditioning module that enables seamless video synthesis driven by expression-guided landmark inputs. Integrating the facial image-text alignment module strengthens the fusion of facial attributes with motion trajectories, reinforcing identity preservation. Additionally, SkyReels-A1 incorporates a multi-stage training paradigm to incrementally refine the correlation between expressions and motion while ensuring stable identity reproduction. Extensive empirical evaluations highlight the model's ability to produce visually coherent and compositionally diverse results, making it highly applicable to domains such as virtual avatars, remote communication, and digital media generation.

SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers

TL;DR

SkyReels-A1 addresses portrait animation by transferring facial expressions and motion from driving video to a reference portrait while preserving identity. It leverages a diffusion-transformer backbone with an expression-aware conditioning module and a cross-modal facial image-text alignment to tightly couple appearance and motion. A three-stage training regime progressively improves motion transfer, identity stability, and temporal coherence across diverse body proportions. Empirical results show superior image quality and motion fidelity compared with baselines, demonstrating robustness and broad applicability to virtual avatars and remote communication, with code and demos released publicly.

Abstract

We present SkyReels-A1, a simple yet effective framework built upon video diffusion Transformer to facilitate portrait image animation. Existing methodologies still encounter issues, including identity distortion, background instability, and unrealistic facial dynamics, particularly in head-only animation scenarios. Besides, extending to accommodate diverse body proportions usually leads to visual inconsistencies or unnatural articulations. To address these challenges, SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence. The system incorporates an expression-aware conditioning module that enables seamless video synthesis driven by expression-guided landmark inputs. Integrating the facial image-text alignment module strengthens the fusion of facial attributes with motion trajectories, reinforcing identity preservation. Additionally, SkyReels-A1 incorporates a multi-stage training paradigm to incrementally refine the correlation between expressions and motion while ensuring stable identity reproduction. Extensive empirical evaluations highlight the model's ability to produce visually coherent and compositionally diverse results, making it highly applicable to domains such as virtual avatars, remote communication, and digital media generation.
Paper Structure (26 sections, 8 equations, 5 figures, 1 table)

This paper contains 26 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: SkyReels-A1 can generate an animated portrait from a reference image, driven by video motions while ensuring identity preservation. Notably, our method ensures accurate transfer of facial expressions and body movements, allowing for realistic, high-quality portrait animations at anybody proportion that integrate naturally into different scenes.
  • Figure 2: Overview of SkyReels-A1 framework. Given an input video sequence and a reference portrait image, we extract facial expression-aware landmarks from the video, which serve as motion descriptors for transferring expressions onto the portrait. Utilizing a conditional video generation framework based on DiT, our approach directly integrates these facial expression-aware landmarks into the input latent space. In alignment with prior research, we employ a pose guidance mechanism constructed within a VAE architecture. This component encodes facial expression-aware landmarks as conditional input for the DiT framework, thereby enabling the model to capture essential low-dimensional visual attributes while preserving the semantic integrity of facial features.
  • Figure 3: Qualitative portrait animation results from SkyReels-A1. Given a static portrait image as input, our model can vividly animate it, ensuring seamless stitching and offering precise control over eyes and lip movements.
  • Figure 4: Qualitative comparisons results. Our SkyReels-A1 model better transfers lip movements and eye gazes from another person, while maintaining the identity of the source portrait.
  • Figure 5: More generated results from Skyreels-A1 in diverse body proportions.