Table of Contents
Fetching ...

AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang

TL;DR

AniMaker tackles the problem of producing coherent, long-form storytelling animations from text by introducing a four-agent pipeline that mirrors professional production: Director builds storyboards, Photography generates multi-candidate clips via an MCTS-Gen strategy, Reviewer applies context-aware AniEval scores to select and sequence clips, and Post-production assembles the final product with voiceovers and subtitles. The framework formalizes the end-to-end process as $oldsymbol{ ext{F}}: oldsymbol{T}_{prompt} ightarrow oldsymbol{V}_{final}$ with intermediate representations and generative steps $G_K$ and $G_C$, enabling efficient best-of-N clip selection and cross-clip coherence. Central innovations are MCTS-Gen, which balances exploration and exploitation during clip generation, and AniEval, a comprehensive, context-aware evaluation framework for multi-shot storytelling animation. Empirical results on TinyStories demonstrate superior performance across VBench and AniEval, along with favorable human ratings, while ablations confirm the value of both components and the efficiency of the search, marking a meaningful step toward production-grade AI storytelling animation.

Abstract

Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

TL;DR

AniMaker tackles the problem of producing coherent, long-form storytelling animations from text by introducing a four-agent pipeline that mirrors professional production: Director builds storyboards, Photography generates multi-candidate clips via an MCTS-Gen strategy, Reviewer applies context-aware AniEval scores to select and sequence clips, and Post-production assembles the final product with voiceovers and subtitles. The framework formalizes the end-to-end process as with intermediate representations and generative steps and , enabling efficient best-of-N clip selection and cross-clip coherence. Central innovations are MCTS-Gen, which balances exploration and exploitation during clip generation, and AniEval, a comprehensive, context-aware evaluation framework for multi-shot storytelling animation. Empirical results on TinyStories demonstrate superior performance across VBench and AniEval, along with favorable human ratings, while ablations confirm the value of both components and the efficiency of the search, marking a meaningful step toward production-grade AI storytelling animation.

Abstract

Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

Paper Structure

This paper contains 50 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: The overall architecture of our AniMaker framework. Given a story input, Director Agent creates detailed scripts and storyboards with reference images. Photography Agent generates candidate video clips using MCTS-Gen, which optimizes exploration-exploitation balance. Reviewer Agent evaluates clips with our AniEval assessment system. Post-production Agent assembles selected clips, adds voiceovers, and synchronizes audio with subtitles. This multi-agent system enables fully automated, high-quality animated storytelling.
  • Figure 2: Illustration of our MCTS-Gen strategy for efficient Best-of-N Sampling.
  • Figure 3: AniEval Score of Different $\mathbf{w_1}$ (initial candidate count) and $\mathbf{w_2}$ (expansion iterations) Combinations.
  • Figure 4: A comparative case showcasing AniMaker and models specialized in visual narratives. This figure illustrates the visualization of the short story of Tom and Lily. In the story, Tom brings a sack of toys to the town square, where he meets a sad girl named Lily who has no toys. Tom offers to share his toys, and the two children happily play together. Three models—StoryDiffusion, StoryAdapter, and AniMaker (ours)—are compared. AniMaker demonstrates superior narrative consistency, emotional expression, and character continuity across frames. It coherently depicts the extended action sequence of Tom picking up the sack, leaving his house, and arriving at the square. In contrast, while StoryDiffusion and StoryAdapter capture key moments from the story, they suffer from inconsistencies in visual coherence and character alignment, with mismatched character appearances highlighted by red boxes in the figure.
  • Figure 5: A comparative case showcasing AniMaker and models capable of generating storytelling videos. This figure visualizes the story of Sue, a little girl who tries to climb a big tree in the park but gets scared. Her friend Tom warns her to be careful, and she climbs down safely. Grateful, Sue hugs Tom, and they play on the swings together. The comparison includes MovieAgent, MMStoryAgent, VideoGen-of-Thought, and AniMaker (ours). AniMaker stands out with coherent scene progression, expressive character interactions, and consistent character identities. It clearly captures Sue’s emotional journey and key events—from climbing the tree and feeling afraid, to receiving help and having fun—demonstrating strong temporal and narrative alignment. In contrast, MovieAgent shows limited relevance to the input story, with inconsistent visuals and abstract content. VideoGen-of-Thought and MMStoryAgent follow the narrative more closely but still suffer from visual continuity issues, with character mismatches highlighted in red boxes.
  • ...and 8 more figures