Table of Contents
Fetching ...

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, Jiangning Zhang

TL;DR

This work introduces One-to-All Animation, an alignment-free framework for pose-driven character animation and image pose transfer that handles arbitrary reference layouts. It reframes training as self-supervised outpainting, leverages a Reference Extractor with Hybrid Reference Fusion Attention, and employs Identity-Robust Pose Control and a Token Replace strategy to achieve robust identity preservation and long-video coherence. The approach demonstrates strong quantitative and qualitative gains over state-of-the-art methods on multiple datasets and scales, enabling cross-scale image-video generation from a single reference. The contributions offer practical flexibility for real-world applications and broaden the scope of diffusion-based, pose-conditioned generation.

Abstract

Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model are available at https://github.com/ssj9596/One-to-All-Animation.

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

TL;DR

This work introduces One-to-All Animation, an alignment-free framework for pose-driven character animation and image pose transfer that handles arbitrary reference layouts. It reframes training as self-supervised outpainting, leverages a Reference Extractor with Hybrid Reference Fusion Attention, and employs Identity-Robust Pose Control and a Token Replace strategy to achieve robust identity preservation and long-video coherence. The approach demonstrates strong quantitative and qualitative gains over state-of-the-art methods on multiple datasets and scales, enabling cross-scale image-video generation from a single reference. The contributions offer practical flexibility for real-world applications and broaden the scope of diffusion-based, pose-conditioned generation.

Abstract

Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model are available at https://github.com/ssj9596/One-to-All-Animation.

Paper Structure

This paper contains 24 sections, 14 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: We introduce One-to-All Animation, a unified framework for pose-driven personalized generation. Unlike prior methods that require both spatially-aligned references and pose retargeting, our framework supports: (1) cross-scale video animation with either retargeted or original driving motion, (2) cross-scale image pose transfer, and (3) temporally coherent long video generation.
  • Figure 2: Visual comparison under spatial-misaligned inputs. Previous methods such as MimicMotion mimicmotion and StableAnimator stableanimator fail to preserve identity. Recent approaches including UniAnimate unianimatedit and Wan-Animate wananimate show degraded quality. Our method maintains robust appearance consistency.
  • Figure 3: Overview of the proposed framework. We introduce outpainting preprocess to handle diverse body proportions through face-centered random masking during training and pose-guided translation at inference. The driving poses are encoded and refined via reference-guided pose control to preserve facial identity despite skeletal mismatch. Reference features are progressively injected through hybrid reference fusion attention, supporting variable resolutions and dynamic sequence lengths.
  • Figure 4: Face region enhancement. We sample a random pose, rescale it to match the driving pose, and apply their skeletal difference to create intentional facial skeleton inconsistency. A color-lighten signal indicates augmented poses.
  • Figure 5: At each denoising timestep, context tokens from the last segment (blue) are replaced into the current segment's initial latents, modulated with $t=0$ to serve as clean signals. This process ensures smooth transitions across segment boundaries.
  • ...and 11 more figures