Table of Contents
Fetching ...

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, Fan Wang

TL;DR

RealisDance-DiT tackles controllable character animation in open-world scenes by tuning a powerful video foundation model (Wan-2.1) with minimal architectural changes and two practical fine-tuning strategies. It replaces heavy Reference Net designs with simple yet effective modifications and employs low-noise warmup plus large-batch, small-iteration training to preserve priors while achieving fast convergence. The authors introduce RealisDance-Val, a challenging open-scene dataset, and demonstrate superior performance across TikTok, UBC Fashion, and RealisDance-Val datasets, including strong qualitative results and a favorable user study. Overall, the work provides a lightweight, robust baseline that can guide future research toward robust, open-world character animation.

Abstract

Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

TL;DR

RealisDance-DiT tackles controllable character animation in open-world scenes by tuning a powerful video foundation model (Wan-2.1) with minimal architectural changes and two practical fine-tuning strategies. It replaces heavy Reference Net designs with simple yet effective modifications and employs low-noise warmup plus large-batch, small-iteration training to preserve priors while achieving fast convergence. The authors introduce RealisDance-Val, a challenging open-scene dataset, and demonstrate superior performance across TikTok, UBC Fashion, and RealisDance-Val datasets, including strong qualitative results and a favorable user study. Overall, the work provides a lightweight, robust baseline that can guide future research toward robust, open-world character animation.

Abstract

Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

Paper Structure

This paper contains 19 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Failure cases of existing methods. Existing methods sometimes generate a face in the silhouette frame, leave dumbbells suspended in the air when the woman squats down, generate artifacts when producing the yoga pose, and generate a realistic face for the comic character.
  • Figure 2: Illustration of architecture modifications and fine-tunable model parameters. The proposed RealisDance-DiT is fine-tuned under the final setting.
  • Figure 7: Illustration of spatially shifted RoPE for the reference latent.
  • Figure 8: Illustration of low-noise warmup strategy.
  • Figure 9: Visualization of frames generated by RealisDance-DiT. The images with orange borders are reference images. Zoom in for better visibility. Please refer to the supplementary materials for all videos.
  • ...and 4 more figures