Table of Contents
Fetching ...

Zero-shot High-fidelity and Pose-controllable Character Animation

Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, Yu-Gang Jiang

TL;DR

This work tackles zero-shot image-to-video character animation from a single image, addressing the dual challenges of high visual fidelity and strict pose control without model training. It introduces PoseAnimate, a reconstruction-based framework equipped with four key innovations: PACM for pose-aware embeddings, DCAM for maintaining identity and temporal coherence, MGDM for decoupled character-background attention, and PATA for smooth pose transitions. Across extensive experiments, PoseAnimate surpasses state-of-the-art training-based methods in character consistency and detail fidelity while maintaining temporal coherence, validating its effectiveness and efficiency. By leveraging existing diffusion models with targeted modules, the approach enables high-quality, pose-controllable animations without requiring additional training data.

Abstract

Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.

Zero-shot High-fidelity and Pose-controllable Character Animation

TL;DR

This work tackles zero-shot image-to-video character animation from a single image, addressing the dual challenges of high visual fidelity and strict pose control without model training. It introduces PoseAnimate, a reconstruction-based framework equipped with four key innovations: PACM for pose-aware embeddings, DCAM for maintaining identity and temporal coherence, MGDM for decoupled character-background attention, and PATA for smooth pose transitions. Across extensive experiments, PoseAnimate surpasses state-of-the-art training-based methods in character consistency and detail fidelity while maintaining temporal coherence, validating its effectiveness and efficiency. By leveraging existing diffusion models with targeted modules, the approach enables high-quality, pose-controllable animations without requiring additional training data.

Abstract

Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.
Paper Structure (20 sections, 10 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 10 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: PoseAnimate framework is capable of generating smooth and high-quality character animations for static character images across various pose sequences.
  • Figure 2: Overview of PoseAnimate. The pipeline is on the left, we first utilize the Pose Alignment Transition Algorithm (PATA) to align the desired pose with a smooth transition to the target pose. We utilize the inversion noise of the source image as the starting point for generation. The optimized pose-aware embedding of PACM, in Sec. \ref{['sub:3.2']}, serves as the unconditional embedding for input. The right side is the illustration of DCAM in Sec. \ref{['sub:3.3']}. The attention block in this module consists of Dual Consistency Attention (DCA), Cross Attention (CA), and Feed-Forward Networks (FFN). Within DCA, we integrate MGDM to independently perform inter-frame attention fusion for the character and background, which further enhance the fidelity of fine-grained details.
  • Figure 3: Illustration of Pose-Aware Control Module. Through two optimizations, the pose-aware embeddings are injected with motion awareness, which enables the alignment of generated actions with the target poses while maintaining consistency in character-independent scenes.
  • Figure 4: Qualitative comparison between our PoseAnimate and other training-based state-of-the-art character animation methods. We overlay the corresponding DensePose on the bottom right corner of the MagicAnimate (Densepose) synthesized frames. Previous methods suffer from inconsistent character appearance and details lost. Source prompt: "A firefighters in the smoke."(left)"A boy in the street."(right).
  • Figure 5: Visualization of ablation studies, with errors highlighted in red circles. Source prompt: "An iron man on the road."