Table of Contents
Fetching ...

Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

Shanti Stewart, Gouthaman KV, Lie Lu, Andrea Fanelli

TL;DR

Control-MVR tackles cross-modal music-video retrieval by learning a joint embedding space through a semi-supervised contrastive framework that blends self-supervised audiovisual alignment with supervised genre signals. The architecture employs dual frozen backbones (audio with MERT and video with CLIP), task-specific heads, and a controllable embedding z^M = (1-α) p_{ssl}^M(q_{ssl}^M) + α p_{sup}^M(q_{sup}^M), where α governs the balance between broad audiovisual patterns and domain-specific knowledge. Empirical results on a genre-annotated AudioSet-derived dataset show state-of-the-art or near-state-of-the-art performance for both self-supervised and genre-supervised retrieval, with explicit controllability demonstrated by tuning α at inference. The approach highlights the practical utility of combining semi-supervised learning with controllable embeddings for flexible cross-modal retrieval in music-video contexts, and suggests extensions to additional labels and language guidance.

Abstract

Content creators often use music to enhance their videos, from soundtracks in movies to background music in video blogs and social media content. However, identifying the best music for a video can be a difficult and time-consuming task. To address this challenge, we propose a novel framework for automatically retrieving a matching music clip for a given video, and vice versa. Our approach leverages annotated music labels, as well as the inherent artistic correspondence between visual and music elements. Distinct from previous cross-modal music retrieval works, our method combines both self-supervised and supervised training objectives. We use self-supervised and label-supervised contrastive learning to train a joint embedding space between music and video. We show the effectiveness of our approach by using music genre labels for the supervised training component, and our framework can be generalized to other music annotations (e.g., emotion, instrument, etc.). Furthermore, our method enables fine-grained control over how much the retrieval process focuses on self-supervised vs. label information at inference time. We evaluate the learned embeddings through a variety of video-to-music and music-to-video retrieval tasks. Our experiments show that the proposed approach successfully combines self-supervised and supervised objectives and is effective for controllable music-video retrieval.

Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

TL;DR

Control-MVR tackles cross-modal music-video retrieval by learning a joint embedding space through a semi-supervised contrastive framework that blends self-supervised audiovisual alignment with supervised genre signals. The architecture employs dual frozen backbones (audio with MERT and video with CLIP), task-specific heads, and a controllable embedding z^M = (1-α) p_{ssl}^M(q_{ssl}^M) + α p_{sup}^M(q_{sup}^M), where α governs the balance between broad audiovisual patterns and domain-specific knowledge. Empirical results on a genre-annotated AudioSet-derived dataset show state-of-the-art or near-state-of-the-art performance for both self-supervised and genre-supervised retrieval, with explicit controllability demonstrated by tuning α at inference. The approach highlights the practical utility of combining semi-supervised learning with controllable embeddings for flexible cross-modal retrieval in music-video contexts, and suggests extensions to additional labels and language guidance.

Abstract

Content creators often use music to enhance their videos, from soundtracks in movies to background music in video blogs and social media content. However, identifying the best music for a video can be a difficult and time-consuming task. To address this challenge, we propose a novel framework for automatically retrieving a matching music clip for a given video, and vice versa. Our approach leverages annotated music labels, as well as the inherent artistic correspondence between visual and music elements. Distinct from previous cross-modal music retrieval works, our method combines both self-supervised and supervised training objectives. We use self-supervised and label-supervised contrastive learning to train a joint embedding space between music and video. We show the effectiveness of our approach by using music genre labels for the supervised training component, and our framework can be generalized to other music annotations (e.g., emotion, instrument, etc.). Furthermore, our method enables fine-grained control over how much the retrieval process focuses on self-supervised vs. label information at inference time. We evaluate the learned embeddings through a variety of video-to-music and music-to-video retrieval tasks. Our experiments show that the proposed approach successfully combines self-supervised and supervised objectives and is effective for controllable music-video retrieval.

Paper Structure

This paper contains 12 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the semi-supervised Control-MVR framework. A dual-branch architecture separately processes music and video, using frozen pre-trained models as well a series of trainable networks. Self-supervised and supervised cross-modal contrastive losses operate on different points in the model architecture. A user-defined weight parameter $\alpha$ provides explicit control of the output embeddings $z^A$ and $z^V$, which are used for music-video retrieval.
  • Figure 2: Control-MVR enables explicit control over the retrieval process at inference time. a) Decreasing $\alpha$ increases self-supervised content in the output embeddings, which in turn improves self-supervised retrieval performance. b) Conversely, increasing $\alpha$ increases genre-supervised content in the output embeddings, which improves genre-supervised retrieval performance.