FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian; Frank Guerin; Andrew Gilbert

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian, Frank Guerin, Andrew Gilbert

TL;DR

FILS introduces a fully self-supervised framework that learns semantic video representations by predicting masked features in a language space while aligning video patches with text via ActCLIP. The method combines a teacher-student EMA architecture, tube-based masking, and patch-wise language-guided contrastive learning, yielding state-of-the-art transfer to egocentric action recognition tasks with efficient training. Key contributions include feature prediction in language space, ActCLIP for action-area patch-text alignment, and comprehensive ablations and qualitative analyses that show improved semantic grounding over pixel-based approaches. The approach has strong practical impact for robust, scalable video understanding, particularly in data-constrained or domain-specific settings, with potential gains from larger-scale data and models.

Abstract

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

TL;DR

Abstract

Paper Structure (17 sections, 9 equations, 7 figures, 5 tables)

This paper contains 17 sections, 9 equations, 7 figures, 5 tables.

Introduction
Related Works
Self-Supervised Video Feature Prediction in Semantic Language Space
Model Architecture
Training Objectives
Experiments
Action Recognition Task
Ablation Study
Attention Visualization
FILS learns semantic representations
Conclusion
Implementation Details
Datasets and Metrics
Comparsion FILS with Pixel Reconstruction
Charades-Ego and EGTEA Action Recognition using FILS is pretrained on SSV2
...and 2 more sections

Figures (7)

Figure 1: Architecture comparisons between MAE, CLIP, MAE+CLIP, and FILS. Contra indicates video-text contrastive loss. The red arrow points to the language space, while the black ones indicate the knowledge flow in the vision space.
Figure 2: Overview of our method. We perform self-supervised feature prediction and video-text contrastive learning simultaneously. The red arrow denotes the features of the patches within the action area.
Figure 3: Attention heatmaps generated for the initial, central, and final frames of the EK100 using the last transformer layer of the model trained with self-supervised strategies including FILS, our second objective (FP), and pixel-domain reconstruction (MSE) after masking.
Figure 4: visualization of the similarity between text and video features for EK100 dataset. The provided text is the action label of the video we used.
Figure 5: The impact of varying pretraining epochs on the Epic-Kitchens-100 dataset. There is a consistent upward trend in action recognition accuracy with an increase in the number of pretraining epochs.
...and 2 more figures

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

TL;DR

Abstract

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Authors

TL;DR

Abstract

Table of Contents

Figures (7)