A Comprehensive Methodological Survey of Human Activity Recognition Across Divers Data Modalities
Jungpil Shin, Najmul Hassan, Abu Saleh Musa Miah1, Satoshi Nishimura
TL;DR
The paper provides a comprehensive methodological survey of Human Activity Recognition across four data modalities (RGB, skeleton, sensor, and multimodal fusion) over 2014–2024, comparing handcrafted feature approaches with end-to-end deep learning and detailing datasets, architectures, and benchmark results. It emphasizes modality-specific challenges, fusion strategies, and the evolution from CNN/RNN-based models to graph-based and transformer-inspired methods, offering dataset-centric insights and practical guidance for researchers and practitioners. Key contributions include a modality-spanning taxonomy, curated dataset descriptions with performance benchmarks, identification of gaps in cross-modality fusion, and future directions such as data augmentation, large-scale datasets, semi-supervised learning, and efficient architectures. The survey highlights the practical impact of HAR in surveillance, healthcare, and human-computer interaction, while outlining concrete research avenues to improve generalization, efficiency, and real-world deployability across diverse environments.
Abstract
Human Activity Recognition (HAR) systems aim to understand human behaviour and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals. Each modality provides unique and complementary information suited to different application scenarios. Consequently, numerous studies have investigated diverse approaches for HAR using these modalities. This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2024, focusing on machine learning (ML) and deep learning (DL) approaches categorized by input data modalities. We review both single-modality and multi-modality techniques, highlighting fusion-based and co-learning frameworks. Additionally, we cover advancements in hand-crafted action features, methods for recognizing human-object interactions, and activity detection. Our survey includes a detailed dataset description for each modality and a summary of the latest HAR systems, offering comparative results on benchmark datasets. Finally, we provide insightful observations and propose effective future research directions in HAR.
