Table of Contents
Fetching ...

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

Yue Ma, Tianyu Yang, Yin Shan, Xiu Li

TL;DR

SimVTP introduces a unified masked autoencoder framework for video-text pretraining, masking both video tubes and text tokens and reconstructing them with a shared Transformer encoder and separate decoders. It emphasizes high masking ratios and optional cross-modal losses (VTC, VTM) to learn cross-modal alignment efficiently, achieving strong results with limited data on MSRVTT and other tasks. Key contributions include the simple architecture, the demonstrated data efficiency on WebVid-2M, and extensive ablations illustrating the impact of masking strategies and cross-modal training. The approach advances transferable multi-modal representations for retrieval, VQA, and grounding without relying on detectors or region-based modules.

Abstract

This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text and then feed them into a unified autencoder to reconstruct the missing pixels and words. Our SimVTP has several properties: 1) Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality, which implicitly learns the cross-modal alignment between video tubes and text tokens. 2) SimVTP not only benefits from a high video masking ratio (e.g. 90%) due to the temporal redundancy of video, but also needs a high text masking ratio (e.g. 75%), which is much higher than BERT (e.g. 15%), to achieve optimal performance. This is because the aid of video modality makes text reconstruction less challenging, which thus needs a higher mask ratio to make the pretext harder for useful feature learning. 3) Equipping SimVTP with video-text contrastive learning (VTC) and video-text matching (VTM), which are two commonly used cross-modal training strategies, could further improve the transferable performance significantly. 4) SimVTP is dataefficent, e.g., pre-training only on 10% data of WebVid-2M, SimVTP achieves surprisingly good results (43.8 R@1) on MSRVTT, which is far above recent state-of-the-art methods pre-trained on both CC3M and WebVid-2M. We transfer our pre-trained model to various downstream tasks and achieve superior performance. The codes and models will be released at https://github.com/mayuelala/SimVTP.

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

TL;DR

SimVTP introduces a unified masked autoencoder framework for video-text pretraining, masking both video tubes and text tokens and reconstructing them with a shared Transformer encoder and separate decoders. It emphasizes high masking ratios and optional cross-modal losses (VTC, VTM) to learn cross-modal alignment efficiently, achieving strong results with limited data on MSRVTT and other tasks. Key contributions include the simple architecture, the demonstrated data efficiency on WebVid-2M, and extensive ablations illustrating the impact of masking strategies and cross-modal training. The approach advances transferable multi-modal representations for retrieval, VQA, and grounding without relying on detectors or region-based modules.

Abstract

This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text and then feed them into a unified autencoder to reconstruct the missing pixels and words. Our SimVTP has several properties: 1) Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality, which implicitly learns the cross-modal alignment between video tubes and text tokens. 2) SimVTP not only benefits from a high video masking ratio (e.g. 90%) due to the temporal redundancy of video, but also needs a high text masking ratio (e.g. 75%), which is much higher than BERT (e.g. 15%), to achieve optimal performance. This is because the aid of video modality makes text reconstruction less challenging, which thus needs a higher mask ratio to make the pretext harder for useful feature learning. 3) Equipping SimVTP with video-text contrastive learning (VTC) and video-text matching (VTM), which are two commonly used cross-modal training strategies, could further improve the transferable performance significantly. 4) SimVTP is dataefficent, e.g., pre-training only on 10% data of WebVid-2M, SimVTP achieves surprisingly good results (43.8 R@1) on MSRVTT, which is far above recent state-of-the-art methods pre-trained on both CC3M and WebVid-2M. We transfer our pre-trained model to various downstream tasks and achieve superior performance. The codes and models will be released at https://github.com/mayuelala/SimVTP.
Paper Structure (9 sections, 1 equation, 5 figures, 5 tables)

This paper contains 9 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: SimVTP is data-efficient. With only 1/10 of WebVid-2M as training data, SimVTP achieves 43.8% R@1 on MSRVTT which outperforms most recent state-of-the-art methods.
  • Figure 2: Our SimVTP architecture. By randomly masking out video tubes and text tokens with an extremely high mask ratio, SimVTP applies a unified encoder and two separate decoders to reconstruct the missing video and text. This unified encoder enables the model to learn cross correspondence by attention blocks in Transformer, benefiting the useful feature learning.
  • Figure 3: The effect of mask ratio. The performance is improved gradually as the text mask ratio increases and reaches the optimal at 75% (top). Video mask ratio works best at 90% under different text mask ratios (bottom).
  • Figure 4: Vido-text reconstruction on WebVid-2M using a SimVTP pre-trained with video mask ratio of 90% and text mask ratio of 75%. From top to bottom, we show original frames, masked frames, reconstructed frames, original text, masked text and reconstructed text.
  • Figure 5: Cross attention weight from the unified encoder learned by our SimVTP. The reference word token is marked in red. The words in the top row are nouns and the ones in the bottom row are verbs. Both types of words can attend the visual content well.