IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation
Zhufeng Xu, Xuan Gao, Feng-Lin Liu, Haoxian Zhang, Zhixue Fang, Yu-Kun Lai, Xiaoqiang Liu, Pengfei Wan, Lin Gao
TL;DR
IM-Animation tackles identity-decoupled character animation under cross-identity motion transfer by introducing a compact implicit motion representation and a mask-token retargeting bottleneck. It encodes per-frame dynamics into 1D motion tokens via a transformer-based encoder–decoder and a learnable codebook, while a mask-token module prevents leakage of motion from the driving video into the source identity. A three-stage training pipeline progressively optimizes motion representation, retargeting, and diffusion-based video generation, enabling robust retargeting across substantial body-shape and pose differences. Across cross-identity and self reenactment benchmarks, IM-Animation achieves competitive or superior fidelity, identity preservation, and realism with limited computational resources, highlighting practical applicability and robustness.
Abstract
Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images' motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.
