Table of Contents
Fetching ...

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Naga VS Raviteja Chappa, Pha Nguyen, Alexander H Nelson, Han-Seok Seo, Xin Li, Page Daniel Dobbs, Khoa Luu

TL;DR

This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data and efficiently uses transformer-based encoders to alleviate the weakly supervised setting of group activity recognition.

Abstract

This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we created local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video were consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders to alleviate the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA, and MPCA metrics.

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

TL;DR

This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data and efficiently uses transformer-based encoders to alleviate the weakly supervised setting of group activity recognition.

Abstract

This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we created local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video were consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders to alleviate the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA, and MPCA metrics.
Paper Structure (17 sections, 4 equations, 7 figures, 7 tables)

This paper contains 17 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of conventional and proposed methods for social activity recognition. The labels in the right image show the predicted labels.
  • Figure 2: Comparison of Actor Relational Learning (ARL) Modules
  • Figure 3: The proposed SoGAR framework adopts a sampling strategy that divides the input video into global and local views in temporal and spatial domains. Since the video clips are sampled at different rates, the global and local views have distinct spatial characteristics and limited fields of view and are subject to spatial augmentations. The teacher network takes in global views ($\bm{x}_{{g}{t}}$) to generate a target, while the student network processes local views ($\bm{x}_{{l}{t}}$ & $\bm{x}_{{l}{s}}$), where $K{l}$$\le$$K_{g}$. We update the network weights by matching the student local views to the target teacher global views, which involves both Temporal Collaborative Learning and Spatio-temporal Cooperative Learning. To accomplish this, we employ a standard ViT-Base backbone with separate space-time attention gberta_2021_ICML and an MLP that predicts target features from student features.
  • Figure 4: Video Transformer Block
  • Figure 5: Inference. We input the video sequence along with their corresponding labels. The output from the model is fed to the downstream task classifier.
  • ...and 2 more figures