Table of Contents
Fetching ...

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

Hector A. Valdez, Kyle Min, Subarna Tripathi

TL;DR

This work pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification, on the EgoClip dataset and incorporates the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE.

Abstract

Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

TL;DR

This work pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification, on the EgoClip dataset and incorporates the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE.

Abstract

Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.
Paper Structure (10 sections, 1 equation, 1 figure, 4 tables)

This paper contains 10 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Given $q_{v}=0.7$, we show the following qualitative results with the vision encoder: row 1, shows 4 frame input; row 2, shows video encoder's layer 4 after visual token pruning; row 3, shows video encoder's layer 7 after visual token pruning; and row 4, shows video encoder's layer 10 after visual token pruning. We follow liang2022patches to prune visual tokens.