SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Naga VS Raviteja Chappa; Pha Nguyen; Alexander H Nelson; Han-Seok Seo; Xin Li; Page Daniel Dobbs; Khoa Luu

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Naga VS Raviteja Chappa, Pha Nguyen, Alexander H Nelson, Han-Seok Seo, Xin Li, Page Daniel Dobbs, Khoa Luu

TL;DR

Abstract

This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we created local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video were consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders to alleviate the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA, and MPCA metrics.

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 7 figures, 7 tables)

This paper contains 17 sections, 4 equations, 7 figures, 7 tables.

Introduction
Related Work
Group Activity Recognition (GAR)
The Proposed Method
Self-Supervised Training
Prediction of motion via Self-Supervised Learning
Establishing Correspondences Across Different Views
The Proposed Objective Function
Inference
Experiments
Datasets
Deep Network Architecture
Implementation Details
Comparison with state-of-the-art methods
Ablation Study
...and 2 more sections

Figures (7)

Figure 1: Overview of conventional and proposed methods for social activity recognition. The labels in the right image show the predicted labels.
Figure 2: Comparison of Actor Relational Learning (ARL) Modules
Figure 3: The proposed SoGAR framework adopts a sampling strategy that divides the input video into global and local views in temporal and spatial domains. Since the video clips are sampled at different rates, the global and local views have distinct spatial characteristics and limited fields of view and are subject to spatial augmentations. The teacher network takes in global views ($\bm{x}_{{g}{t}}$) to generate a target, while the student network processes local views ($\bm{x}_{{l}{t}}$ & $\bm{x}_{{l}{s}}$), where $K{l}$$\le$$K_{g}$. We update the network weights by matching the student local views to the target teacher global views, which involves both Temporal Collaborative Learning and Spatio-temporal Cooperative Learning. To accomplish this, we employ a standard ViT-Base backbone with separate space-time attention gberta_2021_ICML and an MLP that predicts target features from student features.
Figure 4: Video Transformer Block
Figure 5: Inference. We input the video sequence along with their corresponding labels. The output from the model is fed to the downstream task classifier.
...and 2 more figures

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

TL;DR

Abstract

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)