Table of Contents
Fetching ...

DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Ke Li, Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng, Shaoqi Wang, Xiang Chen

Abstract

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT

DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Abstract

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT

Paper Structure

This paper contains 39 sections, 12 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of DIRECT. We decompose video mashup creation into three collaborative modules: the Screenwriter anchors the global structure to align the multimodal content; the Director instantiates segment-level guidance (query, heuristic, pacing); and the Editor executes shot retrieval and orchestration following the editing guidance with closed-loop validation.
  • Figure 2: Visualization of the hierarchical planning workflow in DIRECT. The Screenwriter leverages multimodal source analysis to generate a section-wise global structural plan, and the Director expands it into segment-level editing guidance.
  • Figure 3: Intent-Guided Shot Sequence Editing. The Editor uses a tailored beam search algorithm with dynamic sliding-window trimming to find optimal shot sequences.
  • Figure 4: Qualitative Comparison of Low-Level Coherency. While baseline (top row) only ensures semantic relevance, our method achieves superior visual continuity (matched subject position and motion flow across transitions) and auditory alignment (visual cut points synchronized with musical beats indicated by green crests).
  • Figure 5: Case study of Footage Summarization. It deconstructs the expansive footage library by clustering and captioning semantically related shots into distinct groups.
  • ...and 5 more figures