Table of Contents
Fetching ...

MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding

Jiajie Teng, Huiyu Duan, Yucheng Zhu, Sijing Wu, Guangtao Zhai

TL;DR

The paper tackles automatic background music recommendation for short videos by introducing MVBind, a self-supervised Music-Video embedding space binding model for cross-modal retrieval. It leverages ImageBind‑based multimodal features and a contrastive learning objective to align audio and visual embeddings without manual labels, and it provides the SVM‑10K dataset to address the lack of suitable data for short videos. Key contributions include the construction of a high‑quality short video with music dataset, the MVBind architecture with self‑supervised cross‑modal binding, and comprehensive experiments demonstrating improved Recall@K over baselines; code will be released to support future work. The work enables emotion- and aesthetics-aware music alignment for short videos and offers a valuable resource to propel cross-modal music–video research in streaming platforms.

Abstract

Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.

MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding

TL;DR

The paper tackles automatic background music recommendation for short videos by introducing MVBind, a self-supervised Music-Video embedding space binding model for cross-modal retrieval. It leverages ImageBind‑based multimodal features and a contrastive learning objective to align audio and visual embeddings without manual labels, and it provides the SVM‑10K dataset to address the lack of suitable data for short videos. Key contributions include the construction of a high‑quality short video with music dataset, the MVBind architecture with self‑supervised cross‑modal binding, and comprehensive experiments demonstrating improved Recall@K over baselines; code will be released to support future work. The work enables emotion- and aesthetics-aware music alignment for short videos and offers a valuable resource to propel cross-modal music–video research in streaming platforms.

Abstract

Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
Paper Structure (12 sections, 3 equations, 1 figure, 4 tables)

This paper contains 12 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: An overview of the proposed MVBind. It first separates the audio and video modalities from short videos in the SVM-10K dataset. For the audio signal, the Mel spectrogram is extracted, and then a 1024-dimensional audio feature is obtained using ViT pre-trained by ImageBind. For the video signal, preprocessing is performed (such as removing black borders), and then a 1024-dimensional video feature is extracted using ViT pre-trained by ImageBind. Self-supervised learning is then used to train and connect the two modal features. The ultimate goal is to achieve cross-modal music video retrieval.