Table of Contents
Fetching ...

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang

TL;DR

This work introduces Anim-Director, an autonomous agent powered by large multimodal models to generate long, coherent animation videos from concise narratives without task-specific training. The method orchestrates a six-step workflow: story refinement, script generation, scene image creation and improvement, and video production with quality enhancement, all driven by GPT-4 and integrated with image (Midjourney) and video (Pika) generators. Through self-reflection reasoning and cross-modal prompting, Anim-Director achieves superior image coherence and video quality compared with contemporary baselines, as demonstrated on TinyStories with TaleCraft and VBench metrics. The approach highlights the potential of LMMs as end-to-end directors that can automate complex creative workflows, democratizing animation production and reducing manual intervention.

Abstract

Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

TL;DR

This work introduces Anim-Director, an autonomous agent powered by large multimodal models to generate long, coherent animation videos from concise narratives without task-specific training. The method orchestrates a six-step workflow: story refinement, script generation, scene image creation and improvement, and video production with quality enhancement, all driven by GPT-4 and integrated with image (Midjourney) and video (Pika) generators. Through self-reflection reasoning and cross-modal prompting, Anim-Director achieves superior image coherence and video quality compared with contemporary baselines, as demonstrated on TinyStories with TaleCraft and VBench metrics. The approach highlights the potential of LMMs as end-to-end directors that can automate complex creative workflows, democratizing animation production and reducing manual intervention.

Abstract

Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.
Paper Structure (16 sections, 5 figures, 4 tables)

This paper contains 16 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The overall workflow of our Anim-Director, which employs six steps to complete a whole animation video. The core technology of Anim-Director leverages the GPT-4 model as a director to execute a six-step automated management process, where we realize the deep interaction with generative tools. To enhance the quality of the generated content, we first use an 'Image + Text → Scene Image/Video' approach for controllable visual content generation. We then apply designed enhancements to select the best images and videos from the candidates, ensuring superior output quality.
  • Figure 2: A comparative case showcasing various models. For DPT-T2I and Custom Diffusion, we display generated images for Scenes 1-5, 7-8, and 10. Images for all 10 scenes, including transition scenes 6 and 9, are presented for our model. Upon comparison, it is evident that our model exhibits superior visual coherence and quality. We convert the narrative into audio using Text-To-Speech (TTS), which is synchronized with the generated video. It is shown in supplementary materials.
  • Figure 3: The extracted frames of videos generated by Pika.
  • Figure 4: A case illustrating a story featuring human and animal interactions.
  • Figure 5: An illustrative case depicting a story involving human interaction.