T2VTree: User-Centered Visual Analytics for Agent-Assisted Thought-to-Video Authoring

Zhuoyun Zheng; Yu Dong; Gaorong Liang; Guan Li; Guihua Shan; Shiyu Cheng; Dong Tian; Jianlong Zhou; Jie Liang

T2VTree: User-Centered Visual Analytics for Agent-Assisted Thought-to-Video Authoring

Zhuoyun Zheng, Yu Dong, Gaorong Liang, Guan Li, Guihua Shan, Shiyu Cheng, Dong Tian, Jianlong Zhou, Jie Liang

TL;DR

This work reframes thought-to-video authoring as a human–AI visual analytics problem and introduces T2VTree, a tree-based representation where each node captures a persistent authoring state that links intents, inputs, prompts, and multimodal outputs. It is augmented by agent-assisted planning that generates editable, node-bound plans before execution, enabling non-linear exploration with branch-aware comparison and in-context preview plus stitching for end-to-end assembly. The authors validate the approach with two cultural-heritage multi-scene case studies and a comparative user study against a node-based baseline, showing reduced coordination overhead, better reuse of intermediate results, and improved perceived control. The work demonstrates practical impact by enabling structured, repeatable, and scalable thought-to-video authoring workflows that preserve provenance across scenes and modalities, while maintaining user control over automation.

Abstract

Generative models have substantially expanded video generation capabilities, yet practical thought-to-video creation remains a multi-stage, multi-modal, and decision-intensive process. However, existing tools either hide intermediate decisions behind repeated reruns or expose operator-level workflows that make exploration traces difficult to manage, compare, and reuse. We present T2VTree, a user-centered visual analytics approach for agent-assisted thought-to-video authoring. T2VTree represents the authoring process as a tree visualization. Each node in the tree binds an editable specification (intent, referenced inputs, workflow choice, prompts, and parameters) with the resulting multimodal outputs, making refinement, branching, and provenance inspection directly operable. To reduce the burden of deciding what to do next, a set of collaborating agents translates step-level intent into an executable plan that remains visible and user-editable before execution. We further implement a visual analytics system that integrates branching authoring with in-place preview and stitching for convergent assembly, enabling end-to-end multi-scene creation without leaving the authoring context. We demonstrate T2VTreeVA through two multi-scene case studies and a comparative user study, showing how the T2VTree visualization and editable agent planning support reliable refinement, localized comparison, and practical reuse in real authoring workflows. T2VTree is available at: https://github.com/tezuka0210/T2VTree.

T2VTree: User-Centered Visual Analytics for Agent-Assisted Thought-to-Video Authoring

TL;DR

Abstract

Paper Structure (29 sections, 7 figures, 1 table)

This paper contains 29 sections, 7 figures, 1 table.

Introduction
Related Work
Generative multi-modality authoring workflows
Human--AI Collaboration and Creative Visual Analytics
Formative Study and Design Requirements
Formative Study Design
Observed Workflow: Common Authoring Actions
Key Challenges
Design Requirements
Authoring Workflow with Agent Assistance
Creator-Recognizable Authoring Actions
Agent-Assisted Planning as Authoring Decisions
Persisted Authoring State for Traceability and Reuse
Visual Design of T2VTree
Design Rationale and Alternatives
...and 14 more sections

Figures (7)

Figure 1: User-centered authoring loop during video stitiching. Natural-language intent is translated into an editable plan, executed into visual analytics, and curated for video output, enabling branch-and-revise iteration from state to state.
Figure 1: Quantitative comparison of authoring efficiency and exploration behavior between ComfyUI and T2VTreeVA.
Figure 2: Three authoring alternatives for structure-guided image generation, motivating the tree-based representation in T2VTree.
Figure 3: The visual design of T2VTree. An intent-first authoring step is materialized as a Workflow Planning node that can be transformed into an executable modal color-coded node after agent planning. Solid arrows indicate the user-visible authoring trace after generation, while dashed arrows indicate system-side transformations from planning to modality-specific states.
Figure 4: The T2VTreeVA interface showcasing the authoring session for Case 1. (A) Control Panel used to input the global historical context and style settings. (B) T2VTree View visualizing the branching provenance trace, where nodes represent generated image/video/audio and sibling branches show alternative processes and refinements. (C) Video Stitching View assembling selected clips from divergent branches into the final documentary timeline.
...and 2 more figures

T2VTree: User-Centered Visual Analytics for Agent-Assisted Thought-to-Video Authoring

TL;DR

Abstract

T2VTree: User-Centered Visual Analytics for Agent-Assisted Thought-to-Video Authoring

Authors

TL;DR

Abstract

Table of Contents

Figures (7)