Table of Contents
Fetching ...

Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation

Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Zhifeng Li, Wei Liu, Linfeng Zhang, Qifeng Chen

TL;DR

Follow-Your-Emoji-Faster introduces a diffusion-based framework for expressive freestyle portrait animation driven by expression-aware landmarks. It combines a facial fine-grained loss, a progressive long-term generation strategy, and a Taylor-interpolated caching method to deliver high-quality, stable animations with substantial speedups. The approach generalizes across diverse portrait styles and driving motions, and is validated on the EmojiBench++ benchmark with strong qualitative and quantitative results. A new dataset and benchmark are released to support future research in this domain.

Abstract

We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in https://follow-your-emoji.github.io/.

Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation

TL;DR

Follow-Your-Emoji-Faster introduces a diffusion-based framework for expressive freestyle portrait animation driven by expression-aware landmarks. It combines a facial fine-grained loss, a progressive long-term generation strategy, and a Taylor-interpolated caching method to deliver high-quality, stable animations with substantial speedups. The approach generalizes across diverse portrait styles and driving motions, and is validated on the EmojiBench++ benchmark with strong qualitative and quantitative results. A new dataset and benchmark are released to support future research in this domain.

Abstract

We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in https://follow-your-emoji.github.io/.

Paper Structure

This paper contains 23 sections, 4 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: Qualitative results of our Follow-Your-Emoji-Faster. The images of the input column are the reference portrait and the corresponding motion landmarks. Using exaggerated expressions with landmark sequences, our portrait animation framework can animate freestyle reference portraits, e.g., cartoons, realism, sculptures, and even animals. Furthermore, quantitative results are shown to highlight the efficiency of our accelerating results.
  • Figure 2: The overview of Follow-Your-Emoji-Faster. We extract the features of our expression-aware landmark sequence with a landmark encoder and fuse these features with multi-frame noise first, then we utilize the progressive strategy to mask the frame of the input latent sequence randomly. Finally, we concatenate this latent sequence with the fused multi-frame noise and feed it to the Denoising UNet to conduct the denoising process for video generation. The appearance net and image prompt injection module help our model preserve the identity of the reference portrait, and the temporal attention maintains the temporal consistency. During training, the facial fine-grinded loss guides the Unet to pay more attention to the facial and expression generation. During inference, we align the target landmark with the reference portrait with the motion alignment module. Then, we generate the keyframes and utilize the progressive strategy to predict long videos with Taylor-Interpolated Cache, which accelerate inference process via reusing and predicting layer-wise features.
  • Figure 3: The illustration of progressive strategy. Similar to the training stage, we first generate the keyframes of the video, then we concatenate the first and last frames to the noise and input them into the model to generate the intermediate content.
  • Figure 4: The detail of our facial fine-grained loss. We extract the facial mask and expression mask with our landmark first. Then, we calculate the denoising loss $\mathcal{L}_{FFG}$ in these masked regions.
  • Figure 5: Examples of the EmojiBench++. We collected 500 portraits with high expression diversity, exaggeration, and various visual styles.
  • ...and 11 more figures