Table of Contents
Fetching ...

PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

Ruiqi Chen, Kaitong Cai, Yijia Fan, Keze Wang

TL;DR

PTTA tackles the challenge of generating high-quality animation purely from text by adapting a large pretrained text-to-video model to animation styles. It addresses data scarcity with a self-constructed 12k text–animation video dataset and enables efficient adaptation through HydraLoRA, a parameter-efficient fine-tuning scheme. Experimental results show PTTA outperforms baselines on visual quality and text alignment while maintaining dynamic content, demonstrating the viability of true text-driven animation generation. This work provides both a strong dataset and a scalable methodology for pure text-to-animation synthesis with practical implications for rapid, cost-efficient animation creation.

Abstract

Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.

PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

TL;DR

PTTA tackles the challenge of generating high-quality animation purely from text by adapting a large pretrained text-to-video model to animation styles. It addresses data scarcity with a self-constructed 12k text–animation video dataset and enables efficient adaptation through HydraLoRA, a parameter-efficient fine-tuning scheme. Experimental results show PTTA outperforms baselines on visual quality and text alignment while maintaining dynamic content, demonstrating the viability of true text-driven animation generation. This work provides both a strong dataset and a scalable methodology for pure text-to-animation synthesis with practical implications for rapid, cost-efficient animation creation.

Abstract

Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.

Paper Structure

This paper contains 11 sections, 5 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview: We present PTTA, a pure text-conditioned animation video generation model built upon a self-constructed dataset of over 12,000 high-quality text-video pairs. The data processing pipeline is illustrated on the left of the figure. The model is efficiently fine-tuned on HunyuanVideo using the HydraLoRA strategy, as depicted by the multi-headed Hydra icon, employing an asymmetric multi-branch LoRA framework to effectively enhance both the model’s generalization and generation capabilities.
  • Figure 2: Prompt: anime girl, reading books, desk, gentle action, soft gaze, solo, full body, anime style