Table of Contents
Fetching ...

Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs

Dabing Cheng, Haosen Zhan, Xingchen Zhao, Guisheng Liu, Zemin Li, Jinghui Xie, Zhao Song, Weiguo Feng, Bingyue Peng

TL;DR

This work tackles end-to-end controllable video ad creation by marrying Multimodal Large Language Models with a dense, dual-path video encoder. It defines a three-track, JSON-based editing draft output and uses a denser frame-rate plus a slow-fast processing strategy to capture both temporal dynamics and spatial details. A free-prompt data pipeline enables user-driven customization across duration, storyline, audience, and aesthetics, achieving strong free-prompt adherence and script quality. The approach is validated on a 100K VideoAds dataset, demonstrates transferability to the Shot2story public dataset, and shows competitive or superior performance across multiple quantitative metrics and human evaluations, with practical implications for rapid, controllable video ad production.

Abstract

The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing, with challenges arising from the need to understand videos and tailor the editing according to user requirements. Addressing this need, we propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing. Leveraging the flexibility and generalizability of Multimodal Large Language Models (MLLMs), we defined clear input-output mappings for efficient video creation. To bolster the model's capability in processing and comprehending video content, we introduce a strategic combination of a denser frame rate and a slow-fast processing technique, significantly enhancing the extraction and understanding of both temporal and spatial video information. Furthermore, we introduce a text-to-edit mechanism that allows users to achieve desired video outcomes through textual input, thereby enhancing the quality and controllability of the edited videos. Through comprehensive experimentation, our method has not only showcased significant effectiveness within advertising datasets, but also yields universally applicable conclusions on public datasets.

Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs

TL;DR

This work tackles end-to-end controllable video ad creation by marrying Multimodal Large Language Models with a dense, dual-path video encoder. It defines a three-track, JSON-based editing draft output and uses a denser frame-rate plus a slow-fast processing strategy to capture both temporal dynamics and spatial details. A free-prompt data pipeline enables user-driven customization across duration, storyline, audience, and aesthetics, achieving strong free-prompt adherence and script quality. The approach is validated on a 100K VideoAds dataset, demonstrates transferability to the Shot2story public dataset, and shows competitive or superior performance across multiple quantitative metrics and human evaluations, with practical implications for rapid, controllable video ad production.

Abstract

The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing, with challenges arising from the need to understand videos and tailor the editing according to user requirements. Addressing this need, we propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing. Leveraging the flexibility and generalizability of Multimodal Large Language Models (MLLMs), we defined clear input-output mappings for efficient video creation. To bolster the model's capability in processing and comprehending video content, we introduce a strategic combination of a denser frame rate and a slow-fast processing technique, significantly enhancing the extraction and understanding of both temporal and spatial video information. Furthermore, we introduce a text-to-edit mechanism that allows users to achieve desired video outcomes through textual input, thereby enhancing the quality and controllability of the edited videos. Through comprehensive experimentation, our method has not only showcased significant effectiveness within advertising datasets, but also yields universally applicable conclusions on public datasets.
Paper Structure (26 sections, 7 equations, 11 figures, 7 tables)

This paper contains 26 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Examples of videos generated by our method. Leveraging the Multimodal LLM, we proposed an end-to-end solution for creating advertising narrative videos. This approach directly outputs a draft protocol for video edits by processing inputs including product information, expected requirements (free prompt), and video clips. Importantly, it facilitates precise and controllable video editing that perfectly aligns with user-defined free prompts. For details, refer to the figure where corresponding colors indicate aligned information. See more cases in Appendix \ref{['app-cases']}.
  • Figure 2: Illustration of our model architecture. We input product information, free prompt, and video clips into the model, generating a JSON-formatted video editing draft that is transformed into a video through post-processing and rendering. We implement a slow-fast dual-pathway strategy for frame rates: the fast pathway has a higher frame rate with fewer tokens per frame (yellow star), and the slower pathway has a lower frame rate with more tokens per frame (green star). By integrating the free prompt into instructions, we enhance the flexibility and control of the video editing process.
  • Figure 3: Pipeline for generating free-prompt. We produce high-quality free-prompt by the four-step process: deconstruction, analysis, generation, and verification. All experiments in this paper utilize the "gpt-4o-2024-08-06" version.
  • Figure 4: Depiction of our VideoAds dataset distribution. (a) illustrates the distribution of the number of clips per data sample, encompassing both positive and negative clips, with a mean $\mu_{\mathrm{numClips}}=5.88$. (b) depicts the distribution of clip durations, with a mean $\mu_{\mathrm{clipDuration}}=8.03$ seconds.
  • Figure 5: Analyzing the impact of token numbers on CRA and CSA metrics (fps=0.125).
  • ...and 6 more figures