Table of Contents
Fetching ...

Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video

Alexander Htet Kyaw, Lenin Ravindranath Sivalingam

TL;DR

The paper addresses controllable multimodal storytelling by representing narratives as node-based graphs where each node can generate text, audio, images, and video. A task-selection agent orchestrates a GPT-4.1-based pipeline (Generator, Reasoner, Diagrammer, Editor) to produce and refine a structured node graph and its JSON representation. Experiments on short 8–12 node stories show the system can generate linear and branching narratives with high success rates and supports targeted node and graph-level edits, plus full video export. While promising for accessible, iterative AI-assisted creation, the work notes limitations in scaling and cross-node coherence and points to future human-in-the-loop enhancements and better grounding for multimedia consistency.

Abstract

We present a node-based storytelling system for multimodal content generation. The system represents stories as graphs of nodes that can be expanded, edited, and iteratively refined through direct user edits and natural-language prompts. Each node can integrate text, images, audio, and video, allowing creators to compose multimodal narratives. A task selection agent routes between specialized generative tasks that handle story generation, node structure reasoning, node diagram formatting, and context generation. The interface supports targeted editing of individual nodes, automatic branching for parallel storylines, and node-based iterative refinement. Our results demonstrate that node-based editing supports control over narrative structure and iterative generation of text, images, audio, and video. We report quantitative outcomes on automatic story outline generation and qualitative observations of editing workflows. Finally, we discuss current limitations such as scalability to longer narratives and consistency across multiple nodes, and outline future work toward human-in-the-loop and user-centered creative AI tools.

Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video

TL;DR

The paper addresses controllable multimodal storytelling by representing narratives as node-based graphs where each node can generate text, audio, images, and video. A task-selection agent orchestrates a GPT-4.1-based pipeline (Generator, Reasoner, Diagrammer, Editor) to produce and refine a structured node graph and its JSON representation. Experiments on short 8–12 node stories show the system can generate linear and branching narratives with high success rates and supports targeted node and graph-level edits, plus full video export. While promising for accessible, iterative AI-assisted creation, the work notes limitations in scaling and cross-node coherence and points to future human-in-the-loop enhancements and better grounding for multimedia consistency.

Abstract

We present a node-based storytelling system for multimodal content generation. The system represents stories as graphs of nodes that can be expanded, edited, and iteratively refined through direct user edits and natural-language prompts. Each node can integrate text, images, audio, and video, allowing creators to compose multimodal narratives. A task selection agent routes between specialized generative tasks that handle story generation, node structure reasoning, node diagram formatting, and context generation. The interface supports targeted editing of individual nodes, automatic branching for parallel storylines, and node-based iterative refinement. Our results demonstrate that node-based editing supports control over narrative structure and iterative generation of text, images, audio, and video. We report quantitative outcomes on automatic story outline generation and qualitative observations of editing workflows. Finally, we discuss current limitations such as scalability to longer narratives and consistency across multiple nodes, and outline future work toward human-in-the-loop and user-centered creative AI tools.

Paper Structure

This paper contains 14 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Conversational AI Interface integrated with a Node Based Multimedia Content Generation
  • Figure 2: System Overview: From User Input, Task Selection Agent, Generator, Reasoner, Diagrammer, Node Graph Represenation, Editor, Context Generator to Multimedia Generation
  • Figure 3: Generating a Sequential Storyline Generation versus a Branching Storyline
  • Figure 4: Using an LLM to edit selected nodes and make targeted changes to tone and story details
  • Figure 5: In this example, the “Friends meet the Blue Ghosts” node was regenerated with alternative descriptions, producing different video interpretations. The system displays both versions side by side, allowing the user to compare outcomes and select the preferred branch for continuing the narrative.
  • ...and 12 more figures