Table of Contents
Fetching ...

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian, Frank Guerin, Andrew Gilbert

TL;DR

FILS introduces a fully self-supervised framework that learns semantic video representations by predicting masked features in a language space while aligning video patches with text via ActCLIP. The method combines a teacher-student EMA architecture, tube-based masking, and patch-wise language-guided contrastive learning, yielding state-of-the-art transfer to egocentric action recognition tasks with efficient training. Key contributions include feature prediction in language space, ActCLIP for action-area patch-text alignment, and comprehensive ablations and qualitative analyses that show improved semantic grounding over pixel-based approaches. The approach has strong practical impact for robust, scalable video understanding, particularly in data-constrained or domain-specific settings, with potential gains from larger-scale data and models.

Abstract

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

TL;DR

FILS introduces a fully self-supervised framework that learns semantic video representations by predicting masked features in a language space while aligning video patches with text via ActCLIP. The method combines a teacher-student EMA architecture, tube-based masking, and patch-wise language-guided contrastive learning, yielding state-of-the-art transfer to egocentric action recognition tasks with efficient training. Key contributions include feature prediction in language space, ActCLIP for action-area patch-text alignment, and comprehensive ablations and qualitative analyses that show improved semantic grounding over pixel-based approaches. The approach has strong practical impact for robust, scalable video understanding, particularly in data-constrained or domain-specific settings, with potential gains from larger-scale data and models.

Abstract

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.
Paper Structure (17 sections, 9 equations, 7 figures, 5 tables)

This paper contains 17 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Architecture comparisons between MAE, CLIP, MAE+CLIP, and FILS. Contra indicates video-text contrastive loss. The red arrow points to the language space, while the black ones indicate the knowledge flow in the vision space.
  • Figure 2: Overview of our method. We perform self-supervised feature prediction and video-text contrastive learning simultaneously. The red arrow denotes the features of the patches within the action area.
  • Figure 3: Attention heatmaps generated for the initial, central, and final frames of the EK100 using the last transformer layer of the model trained with self-supervised strategies including FILS, our second objective (FP), and pixel-domain reconstruction (MSE) after masking.
  • Figure 4: visualization of the similarity between text and video features for EK100 dataset. The provided text is the action label of the video we used.
  • Figure 5: The impact of varying pretraining epochs on the Epic-Kitchens-100 dataset. There is a consistent upward trend in action recognition accuracy with an increase in the number of pretraining epochs.
  • ...and 2 more figures