Enhancing Video Music Recommendation with Transformer-Driven Audio-Visual Embeddings
Shimiao Liu, Alexander Lerch
TL;DR
This work tackles automated audio for video recommendations by learning a shared cross-modal embedding that aligns audio and visual content without manual labels. It compares temporal encoders and finds transformer-based encoding (TIVM) paired with InfoNCE loss to be most effective for long-range audio-visual correlations. Using a YouTube-8M music-video subset, the approach demonstrates that contrastive learning plus transformer temporal modeling outperforms triplet-based baselines and LSTM variants, achieving higher recall across top-k metrics. The results offer a practical pathway to scalable, high-quality video soundtracking, with potential impact on creator tools and streaming platforms by enabling automated, artistically coherent soundtrack suggestions.
Abstract
A fitting soundtrack can help a video better convey its content and provide a better immersive experience. This paper introduces a novel approach utilizing self-supervised learning and contrastive learning to automatically recommend audio for video content, thereby eliminating the need for manual labeling. We use a dual-branch cross-modal embedding model that maps both audio and video features into a common low-dimensional space. The fit of various audio-video pairs can then be mod-eled as inverse distance measure. In addition, a comparative analysis of various temporal encoding methods is presented, emphasizing the effectiveness of transformers in managing the temporal information of audio-video matching tasks. Through multiple experiments, we demonstrate that our model TIVM, which integrates transformer encoders and using InfoN Celoss, significantly improves the performance of audio-video matching and surpasses traditional methods.
