Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval
Zeyu Chen, Pengfei Zhang, Kai Ye, Wei Dong, Xin Feng, Yana Zhang
TL;DR
This work tackles cross-modal video-music retrieval under noisy self-supervised supervision, where true video-music pairs are not strictly one-to-one in practice. It introduces the inter-intra (II) modal loss, combining inter-modal contrastive objectives with intra-modal distribution preservation to reduce overfitting to false negatives. The II-CLVM framework, incorporating II loss and Global Sparse sampling, achieves state-of-the-art retrieval on YouTube8M and can integrate multi-modal cues via II-CLVTM. The results demonstrate that II loss improves generalization across self-supervised and supervised, uni- and cross-modal tasks, and remains effective with limited training data, offering a practical approach to robust cross-modal retrieval.
Abstract
The burgeoning short video industry has accelerated the advancement of video-music retrieval technology, assisting content creators in selecting appropriate music for their videos. In self-supervised training for video-to-music retrieval, the video and music samples in the dataset are separated from the same video work, so they are all one-to-one matches. This does not match the real situation. In reality, a video can use different music as background music, and a music can be used as background music for different videos. Many videos and music that are not in a pair may be compatible, leading to false negative noise in the dataset. A novel inter-intra modal (II) loss is proposed as a solution. By reducing the variation of feature distribution within the two modalities before and after the encoder, II loss can reduce the model's overfitting to such noise without removing it in a costly and laborious way. The video-music retrieval framework, II-CLVM (Contrastive Learning for Video-Music Retrieval), incorporating the II Loss, achieves state-of-the-art performance on the YouTube8M dataset. The framework II-CLVTM shows better performance when retrieving music using multi-modal video information (such as text in videos). Experiments are designed to show that II loss can effectively alleviate the problem of false negative noise in retrieval tasks. Experiments also show that II loss improves various self-supervised and supervised uni-modal and cross-modal retrieval tasks, and can obtain good retrieval models with a small amount of training samples.
