Table of Contents
Fetching ...

SIGMA: Sinkhorn-Guided Masked Video Modeling

Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

TL;DR

SIGMA tackles the semantic gap in masked video modeling by jointly learning a target feature space with a projection network and enforcing a high-entropy, cluster-based structure over space-time tubes using entropy-regularized optimal transport via Sinkhorn. The method introduces a learnable prototype set $K$ and a symmetric cross-prediction loss that compels the video encoder and projection network to predict each other’s cluster assignments, avoiding trivial collapse. Across ten datasets and three benchmarks, SIGMA achieves state-of-the-art performance in linear evaluation, full finetuning, unsupervised video object segmentation, and SEVERE generalization, demonstrating enhanced temporal and spatial semantics and robustness. The approach also accommodates different projection networks (e.g., MLP or DINO) and does not rely on heavy augmentations, making it scalable and versatile for large-scale video pretraining.

Abstract

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at: https://quva-lab.github.io/SIGMA.

SIGMA: Sinkhorn-Guided Masked Video Modeling

TL;DR

SIGMA tackles the semantic gap in masked video modeling by jointly learning a target feature space with a projection network and enforcing a high-entropy, cluster-based structure over space-time tubes using entropy-regularized optimal transport via Sinkhorn. The method introduces a learnable prototype set and a symmetric cross-prediction loss that compels the video encoder and projection network to predict each other’s cluster assignments, avoiding trivial collapse. Across ten datasets and three benchmarks, SIGMA achieves state-of-the-art performance in linear evaluation, full finetuning, unsupervised video object segmentation, and SEVERE generalization, demonstrating enhanced temporal and spatial semantics and robustness. The approach also accommodates different projection networks (e.g., MLP or DINO) and does not rely on heavy augmentations, making it scalable and versatile for large-scale video pretraining.

Abstract

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at: https://quva-lab.github.io/SIGMA.
Paper Structure (42 sections, 5 equations, 5 figures, 13 tables)

This paper contains 42 sections, 5 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Overview of our idea. Compared to VideoMAE, which uses RGB pixels as targets, we generate Sinkhorn-regularised features as reconstruction targets. This obtains more semantic features and yields better pretraining performance.
  • Figure 2: Overview of our proposed method sigma . A given video is embedded with the projection network $\varphi$ leading to features $\mathbf{x}^{\varphi}$. The video model $\Psi$ predicts feature embeddings $\mathbf{x}^\Psi$ of the masked space-time tubes. Both embeddings are projected onto the learnable prototypes representing cluster centroids. Cluster assignments are created with an adapted Sinkhorn algorithm enforcing equipartition across all prototypes. These pseudo-labels are then used as targets for the predictive task $\mathcal{L}_{CE}$ with which the networks are optimized.
  • Figure 3: Benchmark II: Unsupervised video object segmentation results on DAVIS. We visualize the abilities of masked video modeling methods to produce temporally consistent semantic segmentation masks. Sigma provides more coherent and consistent object cluster maps compared to other methods. This shows that our learned features have better temporal and spatial understanding.
  • Figure 4: Visualization of prototypes. We visualize the 25 space-time tubes with the highest similarity to a particular prototype inside a video. For simplicity, we visualize the first patch inside the space-time tube. We observe that different prototypes attend to particular semantic parts of the video, as prototype 1 corresponds to the blue parts of the car.
  • Figure 5: Visualization of prototypes (2). We visualize the 25 space-time tubes with the highest similarity to a particular prototype inside a video. For simplicity, we visualize the first patch inside the space-time tube. We observe that different prototypes attend to particular semantic parts of the video, for example, prototype 1 corresponds to the person(s) in white.