Table of Contents
Fetching ...

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

Xuyang Cao, Guoxin Wang, Sheng Shi, Jun Zhao, Yang Yao, Jintao Fei, Minyu Gao

TL;DR

JoyVASA presents a two-stage diffusion-based framework that decouples static 3D facial appearance from dynamic motion to enable longer, more coherent audio-driven animations and extendable animal-face rendering. A diffusion transformer generates identity-independent motion conditioned on audio, while a renderer fuses static appearance with the generated motion to produce high-quality outputs. The approach leverages a disentangled representation using LivePortrait-compatible components and trains on a multilingual, hybrid dataset, achieving competitive quality and broader applicability, including non-human faces. This work advances portrait animation by reducing reliance on motion references and enabling flexible cross-identity rendering, with practical implications for multilingual digital avatars and animal animation.

Abstract

Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code is available at: https://github.com/jdh-algo/JoyVASA.

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

TL;DR

JoyVASA presents a two-stage diffusion-based framework that decouples static 3D facial appearance from dynamic motion to enable longer, more coherent audio-driven animations and extendable animal-face rendering. A diffusion transformer generates identity-independent motion conditioned on audio, while a renderer fuses static appearance with the generated motion to produce high-quality outputs. The approach leverages a disentangled representation using LivePortrait-compatible components and trains on a multilingual, hybrid dataset, achieving competitive quality and broader applicability, including non-human faces. This work advances portrait animation by reducing reliance on motion references and enabling flexible cross-identity rendering, with practical implications for multilingual digital avatars and animal animation.

Abstract

Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code is available at: https://github.com/jdh-algo/JoyVASA.

Paper Structure

This paper contains 17 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Inference Pipeline of the proposed JoyVASA.
  • Figure 2: Training process of the audio-driven motion sequence generation. The audio feature and real motion sequences are first extracted with the frozen wav2vec2 baevski2020wav2vec and the frozen motion encoder in Liveportrait guo2024liveportrait. Then a diffusion transformer model is trained to sample the clean motion sequence from noise.
  • Figure 3: Visualization results of different methods on the celebV-HQ test dataset.
  • Figure 4: Visualization results of different portraits driven by the same audio input on the Openset dataset. Note that our proposed method is able to drive portraits of humans, animations, artworks, and animals at the same time.