Towards Automated Movie Trailer Generation

Dawit Mureja Argaw; Mattia Soldan; Alejandro Pardo; Chen Zhao; Fabian Caba Heilbron; Joon Son Chung; Bernard Ghanem

Towards Automated Movie Trailer Generation

Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem

TL;DR

This work introduces Trailer Generation Transformer (TGT), an autoregressive encoder-decoder framework that translates full movies into plausible trailers by modeling shot sequences with a trailerness-aware encoder and a context-aware Transformer decoder. By training on paired movie-trailer data and employing reconstruction, trailerness, and KL-divergence losses, TGT learns both which shots to include and how to order them, achieving non-chronological, narrative-driven trailers. The authors construct two ATG benchmarks on MAD and MovieNet, showing that TGT outperforms prior trailer-generation and video-summarization methods across multiple metrics, including F1, LD, and SLD, and demonstrate benefits of text-conditioned generation and shot-selection analysis. The work highlights practical implications for automating initial trailer assembly while preserving editorial flexibility, and proposes future extensions to incorporate dialogue and audio modeling for even more realistic trailers.

Abstract

Movie trailers are an essential tool for promoting films and attracting audiences. However, the process of creating trailers can be time-consuming and expensive. To streamline this process, we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. Our approach draws inspiration from machine translation techniques and models the movies and trailers as sequences of shots, thus formulating the trailer generation problem as a sequence-to-sequence task. We introduce Trailer Generation Transformer (TGT), a deep-learning framework utilizing an encoder-decoder architecture. TGT movie encoder is tasked with contextualizing each movie shot representation via self-attention, while the autoregressive trailer decoder predicts the feature representation of the next trailer shot, accounting for the relevance of shots' temporal order in trailers. Our TGT significantly outperforms previous methods on a comprehensive suite of metrics.

Towards Automated Movie Trailer Generation

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 3 figures, 6 tables)

This paper contains 30 sections, 8 equations, 3 figures, 6 tables.

Introduction
Related Works
Trailer Generation
Video Summarization
Methodology
Problem Formulation
Proposed Trailer Generation Transformer
Movie Encoder
Trailerness Encoder
Context Encoder
Trailer Decoder
Training Losses
Datasets
Evaluation Metrics
Experiment
...and 15 more sections

Figures (3)

Figure 1: Trailer Generation Problem and Solutions. The top row depicts the movie and expert-created trailer. The process composes shots from the movie in non-chronological order to create a compelling and intriguing story. Depicted in the central row are the classification/ranking strategies that classify/rank each shot in the movie independently (classification) or with limited relative interaction (ranking). The bottom row represents our approach which can reason over the entire input movie sequence before producing a provisional trailer with non-chronological shots order.
Figure 2: Architecture Overview. Subfigure (a) illustrates our TGT model's training pipeline. Movies are segmented into shots and transformed into visual embeddings via a pre-trained CLIP model radford2021learning. Enhanced with positional embeddings and trailerness scores, these tokens undergo context encoding. The trailer decoder, during training, uses ground-truth trailer shots as queries for cross-attention with encoder output, then parallelly regresses the next shot feature using a causal mask. Subfigure (b) shows the inference pipeline where the trailer decoder sequentially generates trailer shots in an autoregressive manner while the movie encoder process remains unchanged.
Figure 3: Subjective quality of our trailer generation. We compare a movie trailer against the one produced by our TGT method. We highlight in green the correctly selected shots, in orange shots that are visually similar, and in red mismatched shots.

Towards Automated Movie Trailer Generation

TL;DR

Abstract

Towards Automated Movie Trailer Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)