A Survey on Multimodal Wearable Sensor-based Human Action Recognition
Jianyuan Ni, Hao Tang, Syed Tousiful Haque, Yan Yan, Anne H. H. Ngu
TL;DR
This survey addresses the multimodal wearable sensor–based HAR problem by cataloging visual and non-visual data modalities, their DL-driven processing, and how inter- and intra-modal fusion can improve recognition. It synthesizes multimodal HAR methods from computer vision and NLP, including data augmentation, SSL, knowledge distillation, synthetic data generation, and data-to-image transformations to leverage CV pipelines. The authors identify dataset scarcity, alignment challenges, and deployment constraints as key bottlenecks, and propose future directions such as foundation models, unified multimodal architectures, and privacy-preserving personalization. The work aims to guide newcomers and researchers toward robust, scalable, and privacy-aware WSHAR solutions with practical impact for aging populations and beyond.
Abstract
The combination of increased life expectancy and falling birth rates is resulting in an aging population. Wearable Sensor-based Human Activity Recognition (WSHAR) emerges as a promising assistive technology to support the daily lives of older individuals, unlocking vast potential for human-centric applications. However, recent surveys in WSHAR have been limited, focusing either solely on deep learning approaches or on a single sensor modality. In real life, our human interact with the world in a multi-sensory way, where diverse information sources are intricately processed and interpreted to accomplish a complex and unified sensing system. To give machines similar intelligence, multimodal machine learning, which merges data from various sources, has become a popular research area with recent advancements. In this study, we present a comprehensive survey from a novel perspective on how to leverage multimodal learning to WSHAR domain for newcomers and researchers. We begin by presenting the recent sensor modalities as well as deep learning approaches in HAR. Subsequently, we explore the techniques used in present multimodal systems for WSHAR. This includes inter-multimodal systems which utilize sensor modalities from both visual and non-visual systems and intra-multimodal systems that simply take modalities from non-visual systems. After that, we focus on current multimodal learning approaches that have applied to solve some of the challenges existing in WSHAR. Specifically, we make extra efforts by connecting the existing multimodal literature from other domains, such as computer vision and natural language processing, with current WSHAR area. Finally, we identify the corresponding challenges and potential research direction in current WSHAR area for further improvement.
