Table of Contents
Fetching ...

Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval

Zeyu Chen, Pengfei Zhang, Kai Ye, Wei Dong, Xin Feng, Yana Zhang

TL;DR

This work tackles cross-modal video-music retrieval under noisy self-supervised supervision, where true video-music pairs are not strictly one-to-one in practice. It introduces the inter-intra (II) modal loss, combining inter-modal contrastive objectives with intra-modal distribution preservation to reduce overfitting to false negatives. The II-CLVM framework, incorporating II loss and Global Sparse sampling, achieves state-of-the-art retrieval on YouTube8M and can integrate multi-modal cues via II-CLVTM. The results demonstrate that II loss improves generalization across self-supervised and supervised, uni- and cross-modal tasks, and remains effective with limited training data, offering a practical approach to robust cross-modal retrieval.

Abstract

The burgeoning short video industry has accelerated the advancement of video-music retrieval technology, assisting content creators in selecting appropriate music for their videos. In self-supervised training for video-to-music retrieval, the video and music samples in the dataset are separated from the same video work, so they are all one-to-one matches. This does not match the real situation. In reality, a video can use different music as background music, and a music can be used as background music for different videos. Many videos and music that are not in a pair may be compatible, leading to false negative noise in the dataset. A novel inter-intra modal (II) loss is proposed as a solution. By reducing the variation of feature distribution within the two modalities before and after the encoder, II loss can reduce the model's overfitting to such noise without removing it in a costly and laborious way. The video-music retrieval framework, II-CLVM (Contrastive Learning for Video-Music Retrieval), incorporating the II Loss, achieves state-of-the-art performance on the YouTube8M dataset. The framework II-CLVTM shows better performance when retrieving music using multi-modal video information (such as text in videos). Experiments are designed to show that II loss can effectively alleviate the problem of false negative noise in retrieval tasks. Experiments also show that II loss improves various self-supervised and supervised uni-modal and cross-modal retrieval tasks, and can obtain good retrieval models with a small amount of training samples.

Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval

TL;DR

This work tackles cross-modal video-music retrieval under noisy self-supervised supervision, where true video-music pairs are not strictly one-to-one in practice. It introduces the inter-intra (II) modal loss, combining inter-modal contrastive objectives with intra-modal distribution preservation to reduce overfitting to false negatives. The II-CLVM framework, incorporating II loss and Global Sparse sampling, achieves state-of-the-art retrieval on YouTube8M and can integrate multi-modal cues via II-CLVTM. The results demonstrate that II loss improves generalization across self-supervised and supervised, uni- and cross-modal tasks, and remains effective with limited training data, offering a practical approach to robust cross-modal retrieval.

Abstract

The burgeoning short video industry has accelerated the advancement of video-music retrieval technology, assisting content creators in selecting appropriate music for their videos. In self-supervised training for video-to-music retrieval, the video and music samples in the dataset are separated from the same video work, so they are all one-to-one matches. This does not match the real situation. In reality, a video can use different music as background music, and a music can be used as background music for different videos. Many videos and music that are not in a pair may be compatible, leading to false negative noise in the dataset. A novel inter-intra modal (II) loss is proposed as a solution. By reducing the variation of feature distribution within the two modalities before and after the encoder, II loss can reduce the model's overfitting to such noise without removing it in a costly and laborious way. The video-music retrieval framework, II-CLVM (Contrastive Learning for Video-Music Retrieval), incorporating the II Loss, achieves state-of-the-art performance on the YouTube8M dataset. The framework II-CLVTM shows better performance when retrieving music using multi-modal video information (such as text in videos). Experiments are designed to show that II loss can effectively alleviate the problem of false negative noise in retrieval tasks. Experiments also show that II loss improves various self-supervised and supervised uni-modal and cross-modal retrieval tasks, and can obtain good retrieval models with a small amount of training samples.
Paper Structure (22 sections, 10 equations, 7 figures, 8 tables)

This paper contains 22 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The structure of II-CLVM. The global sparse (GS) sampling method is applied for each video and music to extract the pretrained feature sequences $v_{i}$ and $a_{j}$. The encoded features $v_{i}'$ and $m_{j}'$ are then obtained by video and music encoders, respectively. Then, the inter-modal similarity matrix $S$ and the intra-modal similarity matrix $S_{v}$, $S_{v'}$, $S_{m}$, $S_{m'}$ are calculated. The inter-modal loss is calculated from the matrix $S$, and the intra-modal losses for video and music modalities are calculated by $S_{v}$ and $S_{v'}$, $S_{m}$ and $S_{m'}$, respectively.
  • Figure 2: Music feature distribution before and after encoder. The dotted line indicates the feature distribution within a batch. The solid arrows represent the direction in which inter-modal loss acts on the features. The gray arrow increases the feature distance and the blue arrow decreases the distance. Intra-modal loss prevents the distance of false negative sample pairs ($v_2$ and $m_4$) from being larger by maintaining the distribution of pretrained features.
  • Figure 3: The general structure of II-CLVTM framework. Feature $V_{T}'$ is the result of encoder fusion of the video feature sequence $V$ and the text feature sequence $T$. Feature $M$ is processed by another encoder to obtain feature $M'$. The inter-modal similarity matrix $S$ is calculated from $V_{T}'$ and $M'$. The intra-modal similarity matrices $S_{\bar{V}_T}$), $S_{V_{T}'}$, $S_{\bar{M}}$ and $S_{M}'$ are obtained from $\bar{V}_T$(concatenated from $T$ and $\bar{V}$), $V_T'$, $\bar{M}$ and $M'$ respectively. The inter loss is calculated with $S$. The intra loss is calculated with $S_{\bar{V}_T}$), $S_{V_{T}'}$, $S_{\bar{M}}$ and $S_{M}'$, respectively.
  • Figure 4: The change curve of $R@1$ as the intra loss weight $\gamma_2$ increases.
  • Figure 5: The variation curves of the training inter loss, the training intra loss, and the testing $R@1$ over epoch, under the conditions of training only with inter loss and training with the ii loss, respetively.
  • ...and 2 more figures