A Text-Native Interface for Generative Video Authoring

Xingyu Bruce Liu; Mira Dontcheva; Dingzeyu Li

A Text-Native Interface for Generative Video Authoring

Xingyu Bruce Liu, Mira Dontcheva, Dingzeyu Li

TL;DR

Doki is introduced, a text-native interface for generative video authoring, aligning video creation with the natural process of text writing, contributing a fundamental shift in generative video interfaces.

Abstract

Everyone can write their stories in freeform text format -- it's something we all learn in school. Yet storytelling via video requires one to learn specialized and complicated tools. In this paper, we introduce Doki, a text-native interface for generative video authoring, aligning video creation with the natural process of text writing. In Doki, writing text is the primary interaction: within a single document, users define assets, structure scenes, create shots, refine edits, and add audio. We articulate the design principles of this text-first approach and demonstrate Doki's capabilities through a series of examples. To evaluate its real-world use, we conducted a week-long deployment study with participants of varying expertise in video authoring. This work contributes a fundamental shift in generative video interfaces, demonstrating a powerful and accessible new way to craft visual stories.

A Text-Native Interface for Generative Video Authoring

TL;DR

Abstract

Paper Structure (93 sections, 12 figures, 2 tables, 1 algorithm)

This paper contains 93 sections, 12 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Generative Models as Individual Video Generators
Novel Video Authoring Interfaces
Subtractive vs. additive workflows.
Transcript-based editing.
Rich-Text Editing and Dynamic Documents
How People Create Generative Videos Today
Motivating Scenarios
Creator A: Asset-First
Creator B: Script-First
Creator C: Iterative-Exploratory
Recurring Challenges in Generative Video Authoring
Design Principles
Doki
...and 78 more sections

Figures (12)

Figure 1: A comparison of interface paradigms. (a) "Bento Box" style interfaces distribute authoring across multiple, separate representations. (b) Doki's approach uses a text-native canonical representation where the document serves as the primary interface.
Figure 2: Two basic example workflows in Doki. Alice: (1) define assets and shots with slash commands $\rightarrow$ (2) write story and generate previews $\rightarrow$ (3) create video shots; Bob: (a) prompt the sidebar agent for a draft $\rightarrow$ (b) review the AI-generated draft $\rightarrow$ (c) refine with inline AI agent.
Figure 3: Creating a shot in Doki. A new shot is inserted inline with a slash command, and the description that follows serves as its prompt. The system first generates a preview image, which can then be turned into a video clip. Users can click to expand them for playback and additional controls.
Figure 4: Writing consecutive shots within a single paragraph. Later shots inherit context from earlier ones, enabling continuity across a sequence. We achieve strong consistency between shots without repetitive context description in the Doki document.
Figure 5: Creating definitions in Doki. Users open the command menu with a /, select a type, and provide a name and description. Optionally they can add a visual definition for even better consistency.
...and 7 more figures

A Text-Native Interface for Generative Video Authoring

TL;DR

Abstract

A Text-Native Interface for Generative Video Authoring

Authors

TL;DR

Abstract

Table of Contents

Figures (12)