Table of Contents
Fetching ...

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

Xingning Dong, Qingpei Guo, Tian Gan, Qing Wang, Jianlong Wu, Xiangyuan Ren, Yuan Cheng, Wei Chu

TL;DR

This work addresses the efficiency–performance trade-off in pixel-level video-text pre-training by introducing Shared Network Pre-training (SNP), a lightweight, single-encoder framework that processes textual and cross-modal inputs with a shared BERT-type backbone. Complementing SNP, the Significant Semantic Strengthening (S3) strategy provides two novel proxy tasks—Masked Significant Semantic Modeling (MSSM) and Local Vision-Word Matching (LVWM)—to emphasize informative words and improve word-level cross-modal alignment. The method is evaluated on image-text and video-text data, achieving new state-of-the-art results on multiple downstream tasks (TVR, VQA, MC-VQA) and datasets, while reducing parameter count and improving training efficiency. Overall, SNP-S3 delivers robust cross-modal video-text representations suitable for diverse applications, with open-source code and potential for extending to video data end-to-end.

Abstract

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

TL;DR

This work addresses the efficiency–performance trade-off in pixel-level video-text pre-training by introducing Shared Network Pre-training (SNP), a lightweight, single-encoder framework that processes textual and cross-modal inputs with a shared BERT-type backbone. Complementing SNP, the Significant Semantic Strengthening (S3) strategy provides two novel proxy tasks—Masked Significant Semantic Modeling (MSSM) and Local Vision-Word Matching (LVWM)—to emphasize informative words and improve word-level cross-modal alignment. The method is evaluated on image-text and video-text data, achieving new state-of-the-art results on multiple downstream tasks (TVR, VQA, MC-VQA) and datasets, while reducing parameter count and improving training efficiency. Overall, SNP-S3 delivers robust cross-modal video-text representations suitable for diverse applications, with open-source code and potential for extending to video data end-to-end.

Abstract

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.
Paper Structure (25 sections, 9 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 25 sections, 9 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of mainstream pixel-level pre-training architectures: a) Twin-tower-based, b) Three-fusion-based, and c) the proposed Shared Network Pre-training (SNP) methods.
  • Figure 2: Comparison of two widely-employed masking and matching proxy tasks (MLM-1 and GVTM-2) and our improved version (MSSM-3 and LVWM-4).
  • Figure 3: The framework of SNP-$\textbf{S}^\textbf{3}$, which employs a shared BERT-type encoder to process textual and cross-modal features. Following the previous work, we 1) pre-train on image-text datasets, and 2) fine-tune on downstream video-text tasks. We also report the results pre-trained on video-text datasets in Section \ref{['sec:performance_compare']}.
  • Figure 4: Three types of proxy tasks for pre-training the proposed SNP-$\textbf{S}^\textbf{3}$. Notably, we propose two improved tasks marked in red (MSSM and LVWM) to facilitate the cross-modal interaction. Specifically, MSSM first masks out some significant informative words and forces models to restore these clozes, while LVWM learns the video-text alignment at the word level.
  • Figure 5: Details of fine-tuning the pre-trained model on three downstream video-text tasks.
  • ...and 2 more figures