Table of Contents
Fetching ...

Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu, Xiaopeng Hong, Yaowei Wang, Wangmeng Zuo

Abstract

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Abstract

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.
Paper Structure (11 sections, 5 equations, 4 figures, 9 tables)

This paper contains 11 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison of different masking strategies. By selecting the token with the highest temporal density within each cluster, the cluster-wise spatio-temporal masking ensures strong temporal correlation among the retained tokens, thereby effectively mitigating the issue of temporal information leakage.
  • Figure 2: The token with the highest temporal density in a frame typically maintains the highest temporal density in subsequent frames, regardless of its spatial displacement, thereby ensuring that the retained tokens exhibit strong temporal correlation.
  • Figure 3: The ClusterSTM pipeline consists of two main components: the Cluster-wise Spatio-Temporal Masking strategy and the Video-Text Relevance Generation process. The Cluster-wise Spatio-Temporal Masking strategy first performs intra-frame clustering, followed by Temporal-Density-based Cluster-wise Masking. In this way, the retained tokens not only comprehensively capture the holistic content of each frame but also exhibit strong temporal semantic consistency. The Video-Text Relevance Generation process then produces fine-grained video-text relevance matrices, which serve as reconstruction targets for MRM loss computation.
  • Figure 4: A schematic illustration of the Video-Text Relevance Generation process. The module first aggregates each target token with its neighboring tokens through a pooling operator to obtain an enhanced token. This enhanced token is then multiplied with the text feature to produce a high-quality video-text relevance matrix.