Table of Contents
Fetching ...

TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

Alireza Javanmardi, Pragati Jaiswal, Tewodros Amberbir Habtegebrial, Christen Millerdurai, Shaoxiang Wang, Alain Pagani, Didier Stricker

TL;DR

The paper addresses the challenge of generating long-form, temporally coherent upper-body animations from a single image by introducing TalkingPose, a diffusion-based framework that employs a closed-loop control mechanism during inference to stabilize frame-to-frame consistency without extra training. It fuses source appearance with driving motion through a latent-diffusion pipeline using an Appearance Encoder (CLIP + ReferenceNet) and a Motion Encoder, with training on frame pairs to avoid heavy video-stack requirements. A large-scale TalkingPose Dataset of ~18K upper-body videos is released, and experiments on TED-talk, TikTok, and TalkingPose demonstrate state-of-the-art temporal coherence and appearance preservation, while achieving higher efficiency than temporal-layer-reliant baselines. These contributions enable robust, scalable, and unlimited-duration character animation suitable for virtual communication, entertainment, and sign-language contexts.

Abstract

Recent advancements in diffusion models have significantly improved the realism and generalizability of character-driven animation, enabling the synthesis of high-quality motion from just a single RGB image and a set of driving poses. Nevertheless, generating temporally coherent long-form content remains challenging. Existing approaches are constrained by computational and memory limitations, as they are typically trained on short video segments, thus performing effectively only over limited frame lengths and hindering their potential for extended coherent generation. To address these constraints, we propose TalkingPose, a novel diffusion-based framework specifically designed for producing long-form, temporally consistent human upper-body animations. TalkingPose leverages driving frames to precisely capture expressive facial and hand movements, transferring these seamlessly to a target actor through a stable diffusion backbone. To ensure continuous motion and enhance temporal coherence, we introduce a feedback-driven mechanism built upon image-based diffusion models. Notably, this mechanism does not incur additional computational costs or require secondary training stages, enabling the generation of animations with unlimited duration. Additionally, we introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.

TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

TL;DR

The paper addresses the challenge of generating long-form, temporally coherent upper-body animations from a single image by introducing TalkingPose, a diffusion-based framework that employs a closed-loop control mechanism during inference to stabilize frame-to-frame consistency without extra training. It fuses source appearance with driving motion through a latent-diffusion pipeline using an Appearance Encoder (CLIP + ReferenceNet) and a Motion Encoder, with training on frame pairs to avoid heavy video-stack requirements. A large-scale TalkingPose Dataset of ~18K upper-body videos is released, and experiments on TED-talk, TikTok, and TalkingPose demonstrate state-of-the-art temporal coherence and appearance preservation, while achieving higher efficiency than temporal-layer-reliant baselines. These contributions enable robust, scalable, and unlimited-duration character animation suitable for virtual communication, entertainment, and sign-language contexts.

Abstract

Recent advancements in diffusion models have significantly improved the realism and generalizability of character-driven animation, enabling the synthesis of high-quality motion from just a single RGB image and a set of driving poses. Nevertheless, generating temporally coherent long-form content remains challenging. Existing approaches are constrained by computational and memory limitations, as they are typically trained on short video segments, thus performing effectively only over limited frame lengths and hindering their potential for extended coherent generation. To address these constraints, we propose TalkingPose, a novel diffusion-based framework specifically designed for producing long-form, temporally consistent human upper-body animations. TalkingPose leverages driving frames to precisely capture expressive facial and hand movements, transferring these seamlessly to a target actor through a stable diffusion backbone. To ensure continuous motion and enhance temporal coherence, we introduce a feedback-driven mechanism built upon image-based diffusion models. Notably, this mechanism does not incur additional computational costs or require secondary training stages, enabling the generation of animations with unlimited duration. Additionally, we introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.

Paper Structure

This paper contains 22 sections, 4 equations, 19 figures, 9 tables, 1 algorithm.

Figures (19)

  • Figure 1: TalkingPose provides limitless face and gesture animation, while achieving the highest efficiency among current video diffusion approaches. Here, we animate the source characters (single frame input) based on the driving motions.
  • Figure 2: TalkingPose Pipeline.Training: Following AnimateAnyone hu2024animate, the Appearance Encoder (CLIP + ReferenceNet) obtains source features, while the Motion Encoder (driving pose extraction using method of Yang et al. yang2023effective + Pose Encoder) prepares motion cues for the U-net. Inference: A single RGB source and a driving pose condition drive DDIM steps to predict a latent, which is refined via a feedback loop with proportional gain ($\beta$).
  • Figure 3: Qualitative Comparison. This figure shows our model's ability to accurately capture facial expressions, hand gestures, and poses across diverse postures, while preserving the appearance and background of the reference frame, compared to state-of-the-art methods AnimateAnyone hu2024animate, MagicAnimate xu2024magicanimate, Champzhu2024champ, MimicMotion mimicmotion2024, and StableAnimator tu2024stableanimator on TED-talk#siarohin2021motion and TalkingPose.
  • Figure 4: Qualitative Comparison. This figure demonstrates our model's performance compared to state-of-the-art methods AnimateAnyone hu2024animate, MagicAnimate xu2024magicanimate, Champzhu2024champ, MimicMotion mimicmotion2024 and StableAnimator tu2024stableanimator on TikTok dataset jafarian2021learning.
  • Figure 5: Ablation Study on Temporal Analysis. Three sample frames from generated videos under (1) baseline without CLC, (2) with motion module, and (3) our CLC method. Red boxes mark artifacts or temporal errors vs. the reference.
  • ...and 14 more figures