Table of Contents
Fetching ...

AniSora: Exploring the Frontiers of Animation Video Generation in the Sora Era

Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Jing Liu, Chao Xu, Siqi Wang, Yidi Wu, Bingwen Zhu, Xinwen Zhang, Xingyu Zheng, Jixuan Xu, Yue Zhang, Jinlong Hou, Huyang Sun

TL;DR

AniSora addresses animation video generation by building a domain-specific pipeline that yields over 10M text-video pairs and a dedicated 948-video benchmark. It introduces a spatiotemporal diffusion framework based on a DiT backbone with a 3D Causal VAE latent, augmented by a Masked Diffusion Transformer and a Motion Area Condition to enable image-to-video generation, frame interpolation, and localized guidance. The model is initialized from CogVideoX and fine-tuned on animation data, with multi-task training and strategy-driven data curation to improve cross-style consistency. Comprehensive quantitative and human evaluations show strong gains in visual appearance and consistency over state-of-the-art methods, and the work provides public data and prompts to advance animation generation research. Limitations such as artifacts and flickering remain, with proposed future work including reinforcement learning approaches guided by the new benchmark.

Abstract

Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation benchmark. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our entire project is publicly available on https://github.com/bilibili/Index-anisora/tree/main.

AniSora: Exploring the Frontiers of Animation Video Generation in the Sora Era

TL;DR

AniSora addresses animation video generation by building a domain-specific pipeline that yields over 10M text-video pairs and a dedicated 948-video benchmark. It introduces a spatiotemporal diffusion framework based on a DiT backbone with a 3D Causal VAE latent, augmented by a Masked Diffusion Transformer and a Motion Area Condition to enable image-to-video generation, frame interpolation, and localized guidance. The model is initialized from CogVideoX and fine-tuned on animation data, with multi-task training and strategy-driven data curation to improve cross-style consistency. Comprehensive quantitative and human evaluations show strong gains in visual appearance and consistency over state-of-the-art methods, and the work provides public data and prompts to advance animation generation research. Limitations such as artifacts and flickering remain, with proposed future work including reinforcement learning approaches guided by the new benchmark.

Abstract

Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation benchmark. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our entire project is publicly available on https://github.com/bilibili/Index-anisora/tree/main.

Paper Structure

This paper contains 36 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview. We propose AniSora, a comprehensive framework for animation video generation that integrates a high-quality animation dataset, a spatiotemporal conditional model, and a specialized animation video benchmark. The Data Processing Pipeline constructs a 10M video clip dataset derived from 1M diverse long animation videos. The Video Generation model employs a spatiotemporal conditional model, supporting various User Control and Interaction modes and enabling tasks such as frame interpolation, localized guidance, and so on. The benchmark set comprises 948 ground-truth videos spanning diverse styles, common motions, and both 2D and 3D animations. The prompt suite provides standardized prompts and guiding conditions, complemented by a Quantitative Evaluation with six objective metrics for assessing visual appearance and consistency. Additionally, Human Preference Evaluation confirms strong alignment with the proposed metrics. AniSora surpasses SOTA models, establishing a new benchmark for animation video generation.
  • Figure 2: Our method can generate high quality and high consistency in various kinds of 2D/3D animation videos. These examples are generated under image-to-video settings conditioned on the leftmost frame. It is best viewed in color.
  • Figure 3: Method. This figure illustrates the Masked Diffusion Transformer framework for animation video generation, designed to support various spatiotemporal conditioning methods for precise and flexible animation control. A 3D Causal VAE compresses spatial-temporal features into a latent representation, generating the guide feature sequence $G$, while a reprojection network constructs the mask sequence $M$. These components, combined with noise and prompt's feature, serve as input to the Diffusion Transformer. The transformer employs techniques such as patchify, 3D-RoPE embeddings, and 3D full attention to effectively capture spatial-temporal dependencies. This framework seamlessly integrates keyframe interpolation, motion control, and mid-frame extension, simplifying animation production and enhancing creative possibilities.
  • Figure 4: Video generated by Opensora-V1.2 (top) and Vidu (bottom). The top video received a higher score despite containing noticeable distortions.
  • Figure 5: Human Evaluation and Benchmark Results Alignment
  • ...and 8 more figures