Unified Framework with Consistency across Modalities for Human Activity Recognition
Tuyen Tran, Thao Minh Le, Hung Tran, Truyen Tran
TL;DR
The paper tackles the challenge of robust human activity recognition from videos by fusing RGB and skeleton modalities within a unified, modality-agnostic framework. It introduces COMPUTER, a modular compositional human-centric query machine composed of HUB blocks that model human-human and human-context interactions across past, present, and future contexts. A cross-modality consistency loss, implemented as a contrastive objective, aligns representations from different modalities for the same actor, enabling effective, unsupervised-style cross-modal learning while still optimizing label prediction. Empirical results on Spatio-Temporal Action Localization (AVA v2.2) and Group Activity Recognition (Collective Activity) demonstrate consistent gains over strong baselines and prior SoTA methods, validating the approach and its components. The work offers a scalable, extensible framework with practical impact for robust video understanding in multi-modal settings, with code available at the provided repository.
Abstract
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: https://github.com/tranxuantuyen/COMPUTER.
