Table of Contents
Fetching ...

Unified Framework with Consistency across Modalities for Human Activity Recognition

Tuyen Tran, Thao Minh Le, Hung Tran, Truyen Tran

TL;DR

The paper tackles the challenge of robust human activity recognition from videos by fusing RGB and skeleton modalities within a unified, modality-agnostic framework. It introduces COMPUTER, a modular compositional human-centric query machine composed of HUB blocks that model human-human and human-context interactions across past, present, and future contexts. A cross-modality consistency loss, implemented as a contrastive objective, aligns representations from different modalities for the same actor, enabling effective, unsupervised-style cross-modal learning while still optimizing label prediction. Empirical results on Spatio-Temporal Action Localization (AVA v2.2) and Group Activity Recognition (Collective Activity) demonstrate consistent gains over strong baselines and prior SoTA methods, validating the approach and its components. The work offers a scalable, extensible framework with practical impact for robust video understanding in multi-modal settings, with code available at the provided repository.

Abstract

Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: https://github.com/tranxuantuyen/COMPUTER.

Unified Framework with Consistency across Modalities for Human Activity Recognition

TL;DR

The paper tackles the challenge of robust human activity recognition from videos by fusing RGB and skeleton modalities within a unified, modality-agnostic framework. It introduces COMPUTER, a modular compositional human-centric query machine composed of HUB blocks that model human-human and human-context interactions across past, present, and future contexts. A cross-modality consistency loss, implemented as a contrastive objective, aligns representations from different modalities for the same actor, enabling effective, unsupervised-style cross-modal learning while still optimizing label prediction. Empirical results on Spatio-Temporal Action Localization (AVA v2.2) and Group Activity Recognition (Collective Activity) demonstrate consistent gains over strong baselines and prior SoTA methods, validating the approach and its components. The work offers a scalable, extensible framework with practical impact for robust video understanding in multi-modal settings, with code available at the provided repository.

Abstract

Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ( machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: https://github.com/tranxuantuyen/COMPUTER.
Paper Structure (12 sections, 6 equations, 4 figures, 5 tables)

This paper contains 12 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Method overview: We use a unified network architecture $\text{COMPUTER}$ to extract high-level representations of human activity from multi-modal inputs, including RGB sequences and body key points. The entire frame is trained end-to-end using a combination of two loss functions: cross-entropy loss for label prediction and contrastive loss for consistency between modalities. Notably, our consistency loss maximizes the mutual information between different input modalities for the same activity in an unsupervised manner.
  • Figure 2: $\text{COMPUTER}$ models the human-human and human-context interactions in videos using a stack of $\text{HUB}$ blocks. Each $\text{HUB}$ takes as input a human-centric query $q_{i}$ (green circle) of any input modalities and a knowledge base to iteratively refine its knowledge about the human of interest. The knowledge base is spatial-temporal features extracted from past (blue circles), current (red circles) and future (blue circles) video segments. Depending on the knowledge contained in the knowledge base, whether it is human-centric features or general contextual information, the $\text{HUB}$ block can be used to flexibly model the relationships between humans and their relationships with the surrounding entities. Best viewed in color.
  • Figure 3: Quantitative and qualitative analysis of the proposed approach on the AVA dataset.
  • Figure : Ablation on the effectiveness of each modality