Table of Contents
Fetching ...

VC-LLM: Automated Advertisement Video Creation from Raw Footage using Multi-modal LLMs

Dongjun Qian, Kai Su, Yiming Tan, Qishuai Diao, Xian Wu, Chang Liu, Bingyue Peng, Zehuan Yuan

TL;DR

VC-LLM tackles automated short-form advertisement video creation from raw footage by combining a dual-resolution clip representation with multi-modal LLMs and ground-truth script augmentation to reduce hallucinations. The framework outputs sequences of (clip, script, subtitles) conditioned on product information, and is trained with both continued pre-training and supervised fine-tuning on a large, curated dataset. A dedicated benchmark evaluates narrative logic, alignment, factuality, and subtitle quality, showing GPT-4o-based VC-LLM can match or exceed human performance, with fine-tuning yielding clearer narrative structure and better script quality. The results demonstrate practical potential to reduce production costs and enable scalable, personalized advertising at scale.

Abstract

As short videos have risen in popularity, the role of video content in advertising has become increasingly significant. Typically, advertisers record a large amount of raw footage about the product and then create numerous different short-form advertisement videos based on this raw footage. Creating such videos mainly involves editing raw footage and writing advertisement scripts, which requires a certain level of creative ability. It is usually challenging to create many different video contents for the same product, and manual efficiency is often low. In this paper, we present VC-LLM, a framework powered by Large Language Models for the automatic creation of high-quality short-form advertisement videos. Our approach leverages high-resolution spatial input and low-resolution temporal input to represent video clips more effectively, capturing both fine-grained visual details and broader temporal dynamics. In addition, during training, we incorporate supplementary information generated by rewriting the ground truth text, ensuring that all key output information can be directly traced back to the input, thereby reducing model hallucinations. We also designed a benchmark to evaluate the quality of the created videos. Experiments show that VC-LLM based on GPT-4o can produce videos comparable to those created by humans. Furthermore, we collected numerous high-quality short advertisement videos to create a pre-training dataset and manually cleaned a portion of the data to construct a high-quality fine-tuning dataset. Experiments indicate that, on the benchmark, the VC-LLM based on fine-tuned LLM can produce videos with superior narrative logic compared to those created by the VC-LLM based on GPT-4o.

VC-LLM: Automated Advertisement Video Creation from Raw Footage using Multi-modal LLMs

TL;DR

VC-LLM tackles automated short-form advertisement video creation from raw footage by combining a dual-resolution clip representation with multi-modal LLMs and ground-truth script augmentation to reduce hallucinations. The framework outputs sequences of (clip, script, subtitles) conditioned on product information, and is trained with both continued pre-training and supervised fine-tuning on a large, curated dataset. A dedicated benchmark evaluates narrative logic, alignment, factuality, and subtitle quality, showing GPT-4o-based VC-LLM can match or exceed human performance, with fine-tuning yielding clearer narrative structure and better script quality. The results demonstrate practical potential to reduce production costs and enable scalable, personalized advertising at scale.

Abstract

As short videos have risen in popularity, the role of video content in advertising has become increasingly significant. Typically, advertisers record a large amount of raw footage about the product and then create numerous different short-form advertisement videos based on this raw footage. Creating such videos mainly involves editing raw footage and writing advertisement scripts, which requires a certain level of creative ability. It is usually challenging to create many different video contents for the same product, and manual efficiency is often low. In this paper, we present VC-LLM, a framework powered by Large Language Models for the automatic creation of high-quality short-form advertisement videos. Our approach leverages high-resolution spatial input and low-resolution temporal input to represent video clips more effectively, capturing both fine-grained visual details and broader temporal dynamics. In addition, during training, we incorporate supplementary information generated by rewriting the ground truth text, ensuring that all key output information can be directly traced back to the input, thereby reducing model hallucinations. We also designed a benchmark to evaluate the quality of the created videos. Experiments show that VC-LLM based on GPT-4o can produce videos comparable to those created by humans. Furthermore, we collected numerous high-quality short advertisement videos to create a pre-training dataset and manually cleaned a portion of the data to construct a high-quality fine-tuning dataset. Experiments indicate that, on the benchmark, the VC-LLM based on fine-tuned LLM can produce videos with superior narrative logic compared to those created by the VC-LLM based on GPT-4o.

Paper Structure

This paper contains 20 sections, 12 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the VC-LLM Framework: Input product information and video clips, output a sequence of selected video clips, corresponding scripts, and segmented subtitles.
  • Figure 2: The architecture of the model. The model accepts product information and an indexed series of video clips as input, and it generates an output sequence in which each element consists of a selected video clip, its associated script and subtitle segmentation. During training, the parameters of the language model and the text embedding module are updated, whereas the visual encoder’s parameters remain fixed. Text content is processed via the text embedding module to yield an embedding sequence derived from tokenization, while visual content is transformed by the visual encoder into its corresponding embedding sequence. Green blocks denote tokens associated with the spatial input, and yellow blocks denote tokens corresponding to the temporal input.
  • Figure 3: The spatial and temporal representation of a video clip. The spatial input is a higher resolution image of the middle frame, providing detailed information about the product and environment. The temporal input is a sequence of images covering the entire video clip and providing motion information.
  • Figure 4: Clip representation with extra information. During SFT, we use the rewritten script as additional input information to ensure that the critical information required for predicting the ground truth script can be found in the input, thereby reducing model hallucination.
  • Figure 5: Statistics of the dataset: (a) Video Duration - The distribution of overall video lengths. (b) Word Count (Video-level) - The distribution of overall word counts across videos. (c) Number of Clips per Video - The distribution of the typical number of separate clips that are taken from each video. (d) Clip Duration - The distribution of overall video clip lengths. (e) Word Count (Clip-level) - The distribution of overall word counts across individual video clips. (f) Merchandise Industry - The distribution of merchandise industries.
  • ...and 2 more figures