Table of Contents
Fetching ...

AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang

Abstract

Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.

AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Abstract

Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.

Paper Structure

This paper contains 38 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the AutoCut framework. Low–fps frames are tokenized for efficient multimodal reasoning, while high–fps frames are kept for accurate visual matching and clip retrieval. Given optional inputs, AutoCut predicts video, text, and audio tokens, which are decoded or retrieved to compose the final advertisement video.
  • Figure 2: Overview of the proposed AutoCut framework. Multimodal tokenization converts scripts, frames, and audio into unified discrete tokens, which are organized into an alignment input sequence. Stage 1 performs multimodal alignment on large-scale data to align the added token embeddings with the LLM backbone. Stage 2 applies task-specific SFT for video selection, video sorting, script generation, and BGM selection.
  • Figure 3: Dataset statistics for the multimodal alignment (top) and SFT (bottom) stages. The alignment dataset is substantially larger but exhibits more diverse and irregular distributions, while the SFT dataset is smaller yet more balanced and of higher annotation quality.
  • Figure 4: User study results: win–loss ratios against GPT-4o across five evaluation dimensions.
  • Figure 5: Qualitative case studies of our automatic ad video editing pipeline. For each case (top: wireless earbuds, bottom: children’s face cream), we show the provided product information, selected video clips, and the aligned script sentences. Each frame strip visualizes the model’s clip selection and ordering, together with the corresponding sentence and clip-level timestamps, illustrating how the system produces a coherent, time-aligned ad from raw footage and product metadata.
  • ...and 2 more figures