Table of Contents
Fetching ...

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko

TL;DR

The paper tackles text-to-video retrieval by building a multidomain multimodal transformer (MDMMT) that generalizes across multiple video-caption datasets without task-specific finetuning. It extends the baseline MMT architecture with stronger motion-expert features, deeper/wider transformers, and a principled multidataset training regime, coupled with a two-stage overlap-cleaning process to prevent train-test leakage. Key contributions include achieving state-of-the-art results on MSRVTT and LSMDC, showing strong performance on ActivityNet in video-to-text mode, and demonstrating the benefits of combining diverse datasets with carefully tuned sampling weights. The work highlights the practical importance of cross-dataset generalization, provides a framework for intersection analysis, and offers a scalable approach for robust video search in real-world, multi-source data environments.

Abstract

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

TL;DR

The paper tackles text-to-video retrieval by building a multidomain multimodal transformer (MDMMT) that generalizes across multiple video-caption datasets without task-specific finetuning. It extends the baseline MMT architecture with stronger motion-expert features, deeper/wider transformers, and a principled multidataset training regime, coupled with a two-stage overlap-cleaning process to prevent train-test leakage. Key contributions include achieving state-of-the-art results on MSRVTT and LSMDC, showing strong performance on ActivityNet in video-to-text mode, and demonstrating the benefits of combining diverse datasets with carefully tuned sampling weights. The work highlights the practical importance of cross-dataset generalization, provides a framework for intersection analysis, and offers a scalable approach for robust video search in real-world, multi-source data environments.

Abstract

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.

Paper Structure

This paper contains 28 sections, 5 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Two types of fusion
  • Figure 2: Radius of the ball represent the "information size" of dataset. The biggest balls have more diversity in data.
  • Figure 3: Increasing R@5 metric on the MSRVTT full clean split while enriching the train part.
  • Figure 4: Increasing R@5 metric on the ActivityNet test set while enriching the train part.
  • Figure 5: Increasing R@5 metric on the LSMDC test set while enriching the train part.
  • ...and 3 more figures