Table of Contents
Fetching ...

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong, Kairui Wen, Xiaotao Gu, Yong-Jin Liu, Jie Tang

TL;DR

The paper tackles the challenge of studio-grade character animation under diverse, cross-domain conditions by introducing SCAIL, which combines a scalable 3D pose representation with cylindrical bones and a full-context pose injection mechanism within a diffusion-transformer framework. A dedicated data pipeline and Studio-Bench benchmark enable rigorous training and evaluation reflective of production requirements. Empirical results show state-of-the-art performance in both self-driven and cross-driven scenarios, with strong handling of multi-person interactions and occlusions. The work advances production-ready character animation by enabling robust motion transfer across varied figures and domains, while acknowledging limitations and ethical considerations surrounding realistic digital content.

Abstract

Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (\textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

TL;DR

The paper tackles the challenge of studio-grade character animation under diverse, cross-domain conditions by introducing SCAIL, which combines a scalable 3D pose representation with cylindrical bones and a full-context pose injection mechanism within a diffusion-transformer framework. A dedicated data pipeline and Studio-Bench benchmark enable rigorous training and evaluation reflective of production requirements. Empirical results show state-of-the-art performance in both self-driven and cross-driven scenarios, with strong handling of multi-person interactions and occlusions. The work advances production-ready character animation by enabling robust motion transfer across varied figures and domains, while acknowledging limitations and ethical considerations surrounding realistic digital content.

Abstract

Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (\textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

Paper Structure

This paper contains 25 sections, 5 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Overview of the proposed 3D-consistent pose. For scaling implementation, we take the clavicle or the pelvis as the central reference, applying scaling from proximal to distal along each limb in bones set $\mathcal{B}$. $\textit{Aug}(\cdot)$ denotes augmentation in training, $\textit{Ret}(\cdot)$ denotes retargeting in inference, and $\mathcal{P}^{\text{ref}} = \{ \text{P}^{\text{ref}}_j \mid 1 \leq j \leq N \}$ denotes $N$ estimated 2D keypoints in the reference image. We further incorporate hand and face controls by overlaying 2D hand and face keypoints onto the rendered sequences, and align them with the projection of 3D joints during augmentation or retargeting. For better clarity, we omit the drawing process of 2D hand and face in the figure.
  • Figure 2: Overview of SCAIL's model architecture. SCAIL builds upon I2V model and incorporate pose control as an explicit context for the model to learn spatial-temporal motion. To accommodate to the training setting where reference image and video input are sampled from different parts of the video, we modify the I2V model’s input structure by concatenating the reference image at the beginning of the sequence and initiating generation from $T=1$, using the original I2V pattern to inject the reference CLIP feature. To help the model better distinguish the conditional tokens and the noisy video sequence, we leverage the original mask mechanisim of Wan-I2V model architecture, applying an all-one mask for the reference image and the driving sequence, and an all-zero mask for the noisy video sequence.
  • Figure 3: Exploration of different strategies for pose injection.
  • Figure 4: The data curation pipeline. We perform character filtering and motion-speed filtering to construct high-quality training data.
  • Figure 5: User study for comparing our model with popular community and commercial projects.
  • ...and 11 more figures