MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko
TL;DR
The paper tackles text-to-video retrieval by building a multidomain multimodal transformer (MDMMT) that generalizes across multiple video-caption datasets without task-specific finetuning. It extends the baseline MMT architecture with stronger motion-expert features, deeper/wider transformers, and a principled multidataset training regime, coupled with a two-stage overlap-cleaning process to prevent train-test leakage. Key contributions include achieving state-of-the-art results on MSRVTT and LSMDC, showing strong performance on ActivityNet in video-to-text mode, and demonstrating the benefits of combining diverse datasets with carefully tuned sampling weights. The work highlights the practical importance of cross-dataset generalization, provides a framework for intersection analysis, and offers a scalable approach for robust video search in real-world, multi-source data environments.
Abstract
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.
