Table of Contents
Fetching ...

Enhancing Video Music Recommendation with Transformer-Driven Audio-Visual Embeddings

Shimiao Liu, Alexander Lerch

TL;DR

This work tackles automated audio for video recommendations by learning a shared cross-modal embedding that aligns audio and visual content without manual labels. It compares temporal encoders and finds transformer-based encoding (TIVM) paired with InfoNCE loss to be most effective for long-range audio-visual correlations. Using a YouTube-8M music-video subset, the approach demonstrates that contrastive learning plus transformer temporal modeling outperforms triplet-based baselines and LSTM variants, achieving higher recall across top-k metrics. The results offer a practical pathway to scalable, high-quality video soundtracking, with potential impact on creator tools and streaming platforms by enabling automated, artistically coherent soundtrack suggestions.

Abstract

A fitting soundtrack can help a video better convey its content and provide a better immersive experience. This paper introduces a novel approach utilizing self-supervised learning and contrastive learning to automatically recommend audio for video content, thereby eliminating the need for manual labeling. We use a dual-branch cross-modal embedding model that maps both audio and video features into a common low-dimensional space. The fit of various audio-video pairs can then be mod-eled as inverse distance measure. In addition, a comparative analysis of various temporal encoding methods is presented, emphasizing the effectiveness of transformers in managing the temporal information of audio-video matching tasks. Through multiple experiments, we demonstrate that our model TIVM, which integrates transformer encoders and using InfoN Celoss, significantly improves the performance of audio-video matching and surpasses traditional methods.

Enhancing Video Music Recommendation with Transformer-Driven Audio-Visual Embeddings

TL;DR

This work tackles automated audio for video recommendations by learning a shared cross-modal embedding that aligns audio and visual content without manual labels. It compares temporal encoders and finds transformer-based encoding (TIVM) paired with InfoNCE loss to be most effective for long-range audio-visual correlations. Using a YouTube-8M music-video subset, the approach demonstrates that contrastive learning plus transformer temporal modeling outperforms triplet-based baselines and LSTM variants, achieving higher recall across top-k metrics. The results offer a practical pathway to scalable, high-quality video soundtracking, with potential impact on creator tools and streaming platforms by enabling automated, artistically coherent soundtrack suggestions.

Abstract

A fitting soundtrack can help a video better convey its content and provide a better immersive experience. This paper introduces a novel approach utilizing self-supervised learning and contrastive learning to automatically recommend audio for video content, thereby eliminating the need for manual labeling. We use a dual-branch cross-modal embedding model that maps both audio and video features into a common low-dimensional space. The fit of various audio-video pairs can then be mod-eled as inverse distance measure. In addition, a comparative analysis of various temporal encoding methods is presented, emphasizing the effectiveness of transformers in managing the temporal information of audio-video matching tasks. Through multiple experiments, we demonstrate that our model TIVM, which integrates transformer encoders and using InfoN Celoss, significantly improves the performance of audio-video matching and surpasses traditional methods.

Paper Structure

This paper contains 31 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Model Architecture. We use Vggish and ResNet-50 to extract features from audio and video, respectively. The model consists of two separate pathways for processing audio and video, each composed of two or three linear layers. Additionally, we have incorporated an encoder, choosing either a transformer or an LSTM for comparison. The model is trained using the InfoNCE loss.