Table of Contents
Fetching ...

IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation

Zhufeng Xu, Xuan Gao, Feng-Lin Liu, Haoxian Zhang, Zhixue Fang, Yu-Kun Lai, Xiaoqiang Liu, Pengfei Wan, Lin Gao

TL;DR

IM-Animation tackles identity-decoupled character animation under cross-identity motion transfer by introducing a compact implicit motion representation and a mask-token retargeting bottleneck. It encodes per-frame dynamics into 1D motion tokens via a transformer-based encoder–decoder and a learnable codebook, while a mask-token module prevents leakage of motion from the driving video into the source identity. A three-stage training pipeline progressively optimizes motion representation, retargeting, and diffusion-based video generation, enabling robust retargeting across substantial body-shape and pose differences. Across cross-identity and self reenactment benchmarks, IM-Animation achieves competitive or superior fidelity, identity preservation, and realism with limited computational resources, highlighting practical applicability and robustness.

Abstract

Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images' motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.

IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation

TL;DR

IM-Animation tackles identity-decoupled character animation under cross-identity motion transfer by introducing a compact implicit motion representation and a mask-token retargeting bottleneck. It encodes per-frame dynamics into 1D motion tokens via a transformer-based encoder–decoder and a learnable codebook, while a mask-token module prevents leakage of motion from the driving video into the source identity. A three-stage training pipeline progressively optimizes motion representation, retargeting, and diffusion-based video generation, enabling robust retargeting across substantial body-shape and pose differences. Across cross-identity and self reenactment benchmarks, IM-Animation achieves competitive or superior fidelity, identity preservation, and realism with limited computational resources, highlighting practical applicability and robustness.

Abstract

Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images' motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.
Paper Structure (22 sections, 8 equations, 15 figures, 6 tables)

This paper contains 22 sections, 8 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: IM-Animation introduces an impressive implicit motion representation and retargeting method. Our model supports implicit video model motion control in cases with significant scale differences or substantial variations in posture and body shape.
  • Figure 2: We propose IM-Animation, an implicit portrait animation solution. Given a identity image and a motion video, we employ a three-stage training strategy. In the first stage, we train a compact motion encoder based on a 1D tokenizer. In the subsequent second and third stages, we train a temporal retargeting module based on mask tokens, utilizing a lightweight heatmap decoder for intermediate supervision. This approach ensures that we can encode precise retargeted information without disclosing the ID information of the driven video or the pose information of the source image. Ultimately, we achieve end-to-end training of the entire model.
  • Figure 3: Method of Control Signal Injection in Video Model.
  • Figure 4: Qualitative Results. We compare our method with several state-of-the-art approaches, among which UniAnimate-DiT and Wan-Animate are trained using a 14B base model. In contrast, our method is faster while also achieving competitive results. Furthermore, our approach demonstrates impressive performance in maintaining character identity and retargeting significant character differences.
  • Figure 5: Qualitative Results of Ablation Experiment. In each set, the images are arranged from left to right as follows: driving frame, character image, full model performance, and performance of the ablation experiment without the specified module.
  • ...and 10 more figures