Table of Contents
Fetching ...

AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

Jiuniu Wang, Zehua Du, Yuyuan Zhao, Bo Yuan, Kexiang Wang, Jian Liang, Yaxi Zhao, Yihen Lu, Gengliang Li, Junlong Gao, Xin Tu, Zhenyu Guo

TL;DR

AesopAgent tackles the challenge of turning user stories into engaging, coherent videos by coupling a Horizontal Layer that uses Retrieval-Augmented Generation to evolve workflows and prompts with a Utility Layer that provides modular, style- and character-consistent image and video generation utilities. The system introduces Knowledge-RAG (K-RAG) and Experience-RAG (E-RAG) to infuse professional knowledge and expert feedback into both workflow optimization and prompt design, while ensuring consistent visuals through image composition, character consistency, and style transfer. Experimental results show state-of-the-art visual storytelling performance against ComicAI and Artflow, with favorable comparisons to NUWA-XL and AutoStory, and user studies indicating higher fidelity and narrative coherence. The approach demonstrates a scalable, extensible pipeline for end-to-end story-to-video production, with potential to integrate further downstream AI tools (e.g., Runway Gen-2) and support greater customization across genres and characters.

Abstract

The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We propose AesopAgent, an Agent-driven Evolutionary System on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated video is both rich in content and coherent. This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos. Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling. Our AesopAgent is designed for convenient service for individual users, which is available on the following page: https://aesopai.github.io/.

AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

TL;DR

AesopAgent tackles the challenge of turning user stories into engaging, coherent videos by coupling a Horizontal Layer that uses Retrieval-Augmented Generation to evolve workflows and prompts with a Utility Layer that provides modular, style- and character-consistent image and video generation utilities. The system introduces Knowledge-RAG (K-RAG) and Experience-RAG (E-RAG) to infuse professional knowledge and expert feedback into both workflow optimization and prompt design, while ensuring consistent visuals through image composition, character consistency, and style transfer. Experimental results show state-of-the-art visual storytelling performance against ComicAI and Artflow, with favorable comparisons to NUWA-XL and AutoStory, and user studies indicating higher fidelity and narrative coherence. The approach demonstrates a scalable, extensible pipeline for end-to-end story-to-video production, with potential to integrate further downstream AI tools (e.g., Runway Gen-2) and support greater customization across genres and characters.

Abstract

The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We propose AesopAgent, an Agent-driven Evolutionary System on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated video is both rich in content and coherent. This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos. Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling. Our AesopAgent is designed for convenient service for individual users, which is available on the following page: https://aesopai.github.io/.
Paper Structure (28 sections, 2 equations, 13 figures, 3 tables)

This paper contains 28 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of our AesopAgent. This system would convert the user story proposal into a video assembled with images, audio, narration, and special effects. The video generation workflow suggested by AesopAgent, utilizing agent-based approaches and RAG techniques, encompasses script generation, image generation, and video assembly.
  • Figure 2: Illustration of the AesopAgent framework. The bottom part of the figure shows the workflow from the user story proposal to the video, and the top part shows the main components of our method: the Horizontal Layer and the Utility Layer. The Horizontal Layer is responsible for leveraging agent and RAG techniques, optimizing workflow and prompts, and optimizing utilities usage, and the Utility Layer is responsible for providing utilities for image generation and video assembly steps.
  • Figure 3: The two stages of workflow creation and optimization. In the first stage, we interact with the agent about a feasible workflow scheme with E-RAG. In the second stage, when we get a feasible workflow scheme, we continue to optimize workflow by combining the results of actual implementation with E-RAG.
  • Figure 4: Two types of RAG updating mechanism. The left part shows the E-RAG updating process by taking the optimization workflow as an example. The agent summarizes expert experience and relevant experience in the E-RAG dataset, and writes the new experience to the E-RAG database to implement updates. The right part shows the updating process of K-RAG. The K-RAG database is updated by indexing and storing the knowledge documents provided by experts.
  • Figure 5: The overall workflow of story-to-video production. The workflow mainly contains three steps: script generation, image generation, and video assembly. The script generation step includes the title generation process and character design and selection process, followed by the action generation and shot generation process. The image generation step mainly includes the image generation process and image editing process. The video assembly step mainly includes the video material generation process and video editing process.
  • ...and 8 more figures