Table of Contents
Fetching ...

LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation

Hualiang Wei, Shunran Jia, Jialun Liu, Wenhui Li

TL;DR

LiftAvatar is a new paradigm that completes sparse monocular observations in kinematic space and uses the completed signals to drive high-fidelity avatar animation, and introduces a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving.

Abstract

We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.

LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation

TL;DR

LiftAvatar is a new paradigm that completes sparse monocular observations in kinematic space and uses the completed signals to drive high-fidelity avatar animation, and introduces a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving.

Abstract

We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
Paper Structure (13 sections, 3 equations, 11 figures, 4 tables)

This paper contains 13 sections, 3 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: We propose a novel kinematic lifted framework, LiftAvatar, to complement the facial expressions and head poses of the monotonous input video. LiftAvatar can promote subsequent reconstruction and driving tasks, resulting in significant improvements.
  • Figure 2: Method Pipeline. LiftAvatar is a high-precision, expression‑controlled video diffusion transformer that enriches sparse observations to boost downstream avatar performance. It conditions on three groups of inputs: reference information $(I^R, S^R, E^R)$ from the input video; driving information $(S^D, E^D)$ for the target motion; and the ground‑truth driving video $V^D$ during training (replaced by noise at inference). The reference images are encoded by a pre‑trained VAE, and the Reference NPHM Encoder encodes their shading maps $S^R$. These features are concatenated, projected via Reference Patch Embedding, and summed with the embedded expression coefficients $E^R$ to form the reference token $x^R$. Likewise, the Driven NPHM Encoder processes the driving shading maps $S^D$; its output is projected by Driven Patch Embedding and combined with the embedded $E^D$ to produce the driving token $x^D$. The tokens $x^R$ and $x^D$ are concatenated and fed as a unified condition into the Wan2.1 DBLP:journals/corr/abs-2503-20314 video diffusion transformer backbone. Optimized with a flow‑matching objective, the model synthesizes high‑fidelity, temporally coherent videos that accurately follow the driving signals, thereby completing the kinematic space of the original input.
  • Figure 3: Qualitative results for kinematic lifted. We compared our results with non-diffusion-based methods (FOMM DBLP:conf/nips/SiarohinLT0S19 and Face Vid2vid DBLP:conf/cvpr/WangM021) as well as diffusion-based models (DiffusionAvatars DBLP:conf/cvpr/KirschsteinGN24 and LivePortrait DBLP:journals/corr/abs-2407-03168 and HunyuanPortrait DBLP:conf/cvpr/XuYZZJHJZCTL0L25). It is evident that our method provides better performance in generating extreme expressions, particularly in terms of facial texture details, teeth accuracy, and pose accuracy.
  • Figure 4: Qualitative results for head avatar animation. We compare the two head avatar animation methods before and after kinematic lifted. The comparison shows that our proposed strategy can effectively enhance subsequent reconstruction and driving.
  • Figure 5: Lifted results with LiftAvatar.
  • ...and 6 more figures