Table of Contents
Fetching ...

REACT: Recognize Every Action Everywhere All At Once

Naga VS Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu

TL;DR

Recognize Every Action Everywhere All At Once (REACT) is introduced, a novel architecture designed to model complex contextual relationships within videos, offering a robust framework for nuanced scene comprehension.

Abstract

Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (\textbf{R}ecognize \textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features. Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.

REACT: Recognize Every Action Everywhere All At Once

TL;DR

Recognize Every Action Everywhere All At Once (REACT) is introduced, a novel architecture designed to model complex contextual relationships within videos, offering a robust framework for nuanced scene comprehension.

Abstract

Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (\textbf{R}ecognize \textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features. Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.
Paper Structure (21 sections, 4 equations, 4 figures, 5 tables)

This paper contains 21 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An example of the response of REACT model to the user's input query. The user provides a video sequence and an action prompt. Then, the model outputs all the requested actions in the scene by localizing the corresponding actors and provides the overall group activity. Best viewed in color and zoom in.
  • Figure 2: Comparison between prior methods and our approach. Prior methods do a single classification/detection task while fully supervised, whereas our approach performs group activity classification and query-based action detection simultaneously. Best viewed in color and zoom in.
  • Figure 3: Overall architecture of the proposed REACT network. The visual and textual representation learning components from our approach incorporate multi-level feature representations. The extracted features are passed through the contextual relationship modeling block to obtain the concatenated multi-modality features. Then, it is passed through the prompt action retrieval block to obtain the detected bounding boxes based on the prompt.
  • Figure 4: Visualization based on input action query.The top two rows are the results from the JRDB-PAR dataset, and the bottom row is from the Volleyball dataset. Best viewed in color and zoom in.