Table of Contents
Fetching ...

An Inverse Partial Optimal Transport Framework for Music-guided Movie Trailer Generation

Yutong Wang, Sidan Zhu, Hongteng Xu, Dixin Luo

TL;DR

This paper tackles the challenge of generating music-guided movie trailers by formulating trailer construction as a cross-modal, transport-based matching problem. It introduces an inverse partial optimal transport (IPOT) framework that jointly learns a multi-modal latent representation, a movie-shot selector, and a movie-music aligner via bi-level optimization, using a Sinkhorn-based grounding of visual and audio shots. The approach is validated on a new Comprehensive Movie-Trailer Dataset (CMTD), showing superior performance over state-of-the-art baselines in both objective metrics (shot selection accuracy and OT alignment) and subjective user studies. The work provides a principled, data-driven path toward automating trailer generation with semantic alignment between visuals and audio, and it offers a publicly relevant dataset to foster further research in video understanding and cross-modal generation.

Abstract

Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements.

An Inverse Partial Optimal Transport Framework for Music-guided Movie Trailer Generation

TL;DR

This paper tackles the challenge of generating music-guided movie trailers by formulating trailer construction as a cross-modal, transport-based matching problem. It introduces an inverse partial optimal transport (IPOT) framework that jointly learns a multi-modal latent representation, a movie-shot selector, and a movie-music aligner via bi-level optimization, using a Sinkhorn-based grounding of visual and audio shots. The approach is validated on a new Comprehensive Movie-Trailer Dataset (CMTD), showing superior performance over state-of-the-art baselines in both objective metrics (shot selection accuracy and OT alignment) and subjective user studies. The work provides a principled, data-driven path toward automating trailer generation with semantic alignment between visuals and audio, and it offers a publicly relevant dataset to foster further research in video understanding and cross-modal generation.

Abstract

Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements.
Paper Structure (36 sections, 8 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 8 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: An illustration of our IPOT framework for learning a music-guided movie trailer generator.
  • Figure 2: Visualization of the publication year distribution and the category proportions of movies in CMTD.
  • Figure 3: An illustration of the annotations associated with the movie "The Great Gatsby" in CMTD.
  • Figure 4: Comparison between some generated trailer shots and the official trailer shots of the movie "Elysium" based on their appearance order. For each generated trailer, their correctly selected shots are marked with green boxes. The selected shot of our trailer connected to the shot in the official trailer by a yellow arrow means that they belong to the same scene.
  • Figure 5: The violin plot of scores for various methods in user studies. The red crosses are means and the black bars are medians.
  • ...and 4 more figures