Table of Contents
Fetching ...

Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation

Zhecheng Wang, Jiaju Ma, Eitan Grinspun, Tovi Grossman, Bryan Wang

TL;DR

Script2Screen addresses the gap between textual scriptwriting and audiovisual production by integrating an AI-driven, interactive text-to-audiovisual pipeline. The approach leverages an LLM to annotate scripts and generate expressive speech and gesture-synchronized animation, delivered through a WYSIWYG UI that supports voice, gesture, and camera control. A formative study identifies the writing–production gap and the potential for audiovisual aids, while a user study with professionals and novices shows improvements in ideation, dialogue quality, and editing efficiency, albeit with some transparency and speed trade-offs. Overall, the work demonstrates a viable pathway for multimodal creativity tools that augment, rather than replace, human authors in script development and early storyboard ideation.

Abstract

Scriptwriting has traditionally been text-centric, a modality that only partially conveys the produced audiovisual experience. A formative study with professional writers informed us that connecting textual and audiovisual modalities can aid ideation and iteration, especially for writing dialogues. In this work, we present Script2Screen, an AI-assisted tool that integrates scriptwriting with audiovisual scene creation in a unified, synchronized workflow. Focusing on dialogues in scripts, Script2Screen generates expressive scenes with emotional speeches and animated characters through a novel text-to-audiovisual-scene pipeline. The user interface provides fine-grained controls, allowing writers to fine-tune audiovisual elements such as character gestures, speech emotions, and camera angles. A user study with both novice and professional writers from various domains demonstrated that Script2Screen's interactive audiovisual generation enhances the scriptwriting process, facilitating iterative refinement while complementing, rather than replacing, their creative efforts.

Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation

TL;DR

Script2Screen addresses the gap between textual scriptwriting and audiovisual production by integrating an AI-driven, interactive text-to-audiovisual pipeline. The approach leverages an LLM to annotate scripts and generate expressive speech and gesture-synchronized animation, delivered through a WYSIWYG UI that supports voice, gesture, and camera control. A formative study identifies the writing–production gap and the potential for audiovisual aids, while a user study with professionals and novices shows improvements in ideation, dialogue quality, and editing efficiency, albeit with some transparency and speed trade-offs. Overall, the work demonstrates a viable pathway for multimodal creativity tools that augment, rather than replace, human authors in script development and early storyboard ideation.

Abstract

Scriptwriting has traditionally been text-centric, a modality that only partially conveys the produced audiovisual experience. A formative study with professional writers informed us that connecting textual and audiovisual modalities can aid ideation and iteration, especially for writing dialogues. In this work, we present Script2Screen, an AI-assisted tool that integrates scriptwriting with audiovisual scene creation in a unified, synchronized workflow. Focusing on dialogues in scripts, Script2Screen generates expressive scenes with emotional speeches and animated characters through a novel text-to-audiovisual-scene pipeline. The user interface provides fine-grained controls, allowing writers to fine-tune audiovisual elements such as character gestures, speech emotions, and camera angles. A user study with both novice and professional writers from various domains demonstrated that Script2Screen's interactive audiovisual generation enhances the scriptwriting process, facilitating iterative refinement while complementing, rather than replacing, their creative efforts.

Paper Structure

This paper contains 63 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Design space of audiovisual storytelling systems positioned by granularity of authoring control and expressiveness of output. We highlight one representative system per quadrant to reflect the most recent advances in each design space region. While prior tools focus on symbolic visualization or scene-level previsualization, Script2Screen uniquely supports fine-grained, expressive authoring across voice, gesture, and shot design.
  • Figure 2: Before generating the animation shown in Figure \ref{['fig:teaser']}.B, the system presents a preview stage where users review an automatically generated (1) title and (2) logline summarizing their script. They can then customize the scene by (3) selecting character names, (4) choosing from a wide range of voice options and (5) visual models. Once satisfied, users click Synthesis to begin the animation process, as detailed in Section \ref{['sec:backend_system']}.
  • Figure 3: In each dialogue card, when the user hovers over the text area, the system highlights the (1) emotion analysis and the (2) shot analysis to explain the rationale behind the generated animations. Users can (3) directly edit the speech content like a text editor, (4) play the audio, (5) reset it, and (6) choose from various speech emotion options such as "Agreement" or "Flirty" to guide vocal tone. The text bubble (7) supports click-and-edit functionality for seamless revision.
  • Figure 4: Overview of the Script2Screen text-to-audiovisual generation pipeline. The pipeline begins with the user's written script (1), which is processed by the text annotation module (2) using an LLM to parse and annotate the text, extracting information as outlined in section Section \ref{['sec:script_to_dialogue']}. The annotated text is then sent to the audio generation module (3) to synthesize expressive speech audio. Both the speech and emotion tags are subsequently passed to the gesture generation module (4) to produce a motion capture file. This file is used in the animation retargeting module (5) to render an animated character, which is displayed on the interactive viewer (6) for user interaction. Based on the generated animation, Users can then iterate on the scriptwriting and initiate new generations as needed.
  • Figure 5: The text annotation module parses the script using Large Language Models (LLMs) to extract dialogue, speaker names, and other narrative elements (Section \ref{['sec:script_to_dialogue']}). The LLM identifies characters (1), semantically parses text to verbal speech (2), assigns emotion and style labels to dialogue lines (3), and selects shot types and camera angles for each line (4). The structured output is saved as a JSON object for use in generating synchronized audio and animations.
  • ...and 4 more figures