Table of Contents
Fetching ...

SITAR: Semi-supervised Image Transformer for Action Recognition

Owais Iqbal, Omprakash Chakraborty, Aftab Hussain, Rameswar Panda, Abir Das

TL;DR

This paper rearranges multiple frames from the input videos in row-column form to construct super images and employs a 2D image-transformer to generate representations and applies a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos.

Abstract

Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep $3$D transformers for this can introduce significant computational complexity. In this paper, our objective is to address video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos along with a collection of unlabeled videos in a compute efficient manner. Specifically, we rearrange multiple frames from the input videos in row-column form to construct super images. Subsequently, we capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images. Our proposed approach employs two pathways to generate representations for temporally augmented super images originating from the same video. Specifically, we utilize a 2D image-transformer to generate representations and apply a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos. Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition across various benchmark datasets, all while significantly reducing computational costs.

SITAR: Semi-supervised Image Transformer for Action Recognition

TL;DR

This paper rearranges multiple frames from the input videos in row-column form to construct super images and employs a 2D image-transformer to generate representations and applies a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos.

Abstract

Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep D transformers for this can introduce significant computational complexity. In this paper, our objective is to address video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos along with a collection of unlabeled videos in a compute efficient manner. Specifically, we rearrange multiple frames from the input videos in row-column form to construct super images. Subsequently, we capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images. Our proposed approach employs two pathways to generate representations for temporally augmented super images originating from the same video. Specifically, we utilize a 2D image-transformer to generate representations and apply a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos. Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition across various benchmark datasets, all while significantly reducing computational costs.
Paper Structure (32 sections, 5 equations, 9 figures, 8 tables)

This paper contains 32 sections, 5 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Construction of Super Image. Aligning frames to form a grid using three different temporal orderings: normal, random, and reverse.
  • Figure 2: SITAR framework: The proposed framework uses two pathways to process the unlabeled videos, namely, Primary and Secondary, using an image-transformer backbone and sharing the same weights. The Primary pathway is initially trained with limited labeled data. Then, we generate two versions of super images for the unlabeled videos, one fast, having more frames and other slow, having lower frames and pass them through Primary and Secondary pathways respectively. The training objective is to maximize the agreement between the output predictions of the two pathways. To achieve this, we employ two types of contrastive losses. First, an instance contrastive loss to align the representations of a given unlabeled super image across both the pathways. Second, a group contrastive loss to align the average representations of unlabeled super images grouped using pseudo-labels. During inference, only the Primary pathway is used to indentify actions. (Best viewed in color.)
  • Figure 3: Instance contrastive loss vs Group contrastive loss.SITAR employs two different contrastive losses to leverage on the unlabeled super images. The Instance contrastive loss maximized the agreement between two instances of the same videos which minimizing the agreement with the other videos in a given mini-batch. This risks the same action samples of the mini-batch to be inadvertently pushed apart (right). To mitigate this we employ a Group contrastive loss, which first groups videos with the same activity class (left) as predicted by high-confidence pseudo-labels. Then the average representation is obtained for each group and the contrastive learning policy is applied at this group level. (Best viewed in color.)
  • Figure 4: Effect of hyperparameters on HMDB51, (Left) Varying the ratio of unlabeled data to the labeled data ($\mu$), (Right) Varying the instance-contrastive loss weight ($\gamma$)
  • Figure 5: Effect of Hyperparameters on HMDB51, Varying the group-contrastive loss weight ($\beta$)
  • ...and 4 more figures