Table of Contents
Fetching ...

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun

Abstract

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Abstract

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We present an automated music-driven video editing system that transforms hours-long footage into high-quality short videos based on user instructions and given music rhythm, so that the resulting video demonstrates precise music synchronization, faithful instruction following, and visually appealing aesthetics.
  • Figure 2: The whole workflow of the CutClaw. The multi-modal footage is first Deconstructed, and then, the shot plan is generated by the Playwriter, scene retrieval and editing by the Editor, and quality validation by the Reviewer.
  • Figure 3: Left: Video Shots Aggregation. We first perform shot detection on the entire video and conduct a caption for a detailed understanding. Then, we use an LLM to aggregate similar content for a scene-level description. Right: The workflow of Playwriter. Playwriter generate the whole storyline according to the input, and gives the detailed shot plan of each specific shot of music.
  • Figure 4: Editor and Reviewer are used to perform segment selection and validation. SNR stands for Semantic Neighborhood Retrieval, and FGST stands for Fine-Grained Shot Trimming.
  • Figure 5: A sample execution of a single-shot cutting, utilizing footage from movie "Interstellar" and the music "Moon." Actions performed by the Playwright, Editor, and Reviewer are color-coded in blue, yellow, and green, respectively. The RGB]251,241,233orange background traces the execution path leading to the final clip selection.
  • ...and 1 more figures