Table of Contents
Fetching ...

AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia, Sarah Ostadabbas

TL;DR

The paper addresses the need for automated ad clipping by recasting it as shot selection to derive 15-second ads from 30-second originals. It introduces AdSum204, a dataset of 102 paired ad clips with precise shot mappings, and presents AdSum, a two-stream audio-visual model that fuses synchronized visual (3D-CNN Swin3D) and audio (Wav2Vec2) features. Early fusion of modalities yields the best performance across multiple metrics (AP, AUROC, Spearman, Kendall) and highlights the crucial role of audio in advertising content. The work enables scalable, cost-efficient generation of short-form video ads and provides code and data for future research in ad-specific video summarization.

Abstract

Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall. The dataset and code are available at https://github.com/ostadabbas/AdSum204.

AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

TL;DR

The paper addresses the need for automated ad clipping by recasting it as shot selection to derive 15-second ads from 30-second originals. It introduces AdSum204, a dataset of 102 paired ad clips with precise shot mappings, and presents AdSum, a two-stream audio-visual model that fuses synchronized visual (3D-CNN Swin3D) and audio (Wav2Vec2) features. Early fusion of modalities yields the best performance across multiple metrics (AP, AUROC, Spearman, Kendall) and highlights the crucial role of audio in advertising content. The work enables scalable, cost-efficient generation of short-form video ads and provides code and data for future research in ad-specific video summarization.

Abstract

Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall. The dataset and code are available at https://github.com/ostadabbas/AdSum204.

Paper Structure

This paper contains 15 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Video ads with various durations in the same campaign. The screenshots include two examples of long (i.e., 31-second) and short (i.e., 16-second) video ads from brands' YouTube channels. Red circles (orange squares) highlight the ad durations (titles).
  • Figure 2: Shot selection. The figure illustrates a pair of 30-second and 15-second ads from a McDonald's ad campaign. The 30-second ad (left) contains 17 shots. The 15-second ad (right) contains 9 shots from the 30-second ad, as indicated by the matching (e.g., shot 1 - shot 1). We show the first and last frames of each shot for better presentation.
  • Figure 3: An overview of videos in our dataset. We sample frames from 36 (3x12) videos.
  • Figure 4: Shot count and duration histogram. 30-second (15-second) ads contain 18 (10) shots on average; the average shot duration is 1.67 (1.53) seconds, respectively.
  • Figure 5: Ad clipping pipeline. Our methodology includes three steps: (1) shot generation, (2) frame importance prediction, and (3) shot selection to make a short video ad.
  • ...and 2 more figures