Table of Contents
Fetching ...

A Survey on Multimodal Wearable Sensor-based Human Action Recognition

Jianyuan Ni, Hao Tang, Syed Tousiful Haque, Yan Yan, Anne H. H. Ngu

TL;DR

This survey addresses the multimodal wearable sensor–based HAR problem by cataloging visual and non-visual data modalities, their DL-driven processing, and how inter- and intra-modal fusion can improve recognition. It synthesizes multimodal HAR methods from computer vision and NLP, including data augmentation, SSL, knowledge distillation, synthetic data generation, and data-to-image transformations to leverage CV pipelines. The authors identify dataset scarcity, alignment challenges, and deployment constraints as key bottlenecks, and propose future directions such as foundation models, unified multimodal architectures, and privacy-preserving personalization. The work aims to guide newcomers and researchers toward robust, scalable, and privacy-aware WSHAR solutions with practical impact for aging populations and beyond.

Abstract

The combination of increased life expectancy and falling birth rates is resulting in an aging population. Wearable Sensor-based Human Activity Recognition (WSHAR) emerges as a promising assistive technology to support the daily lives of older individuals, unlocking vast potential for human-centric applications. However, recent surveys in WSHAR have been limited, focusing either solely on deep learning approaches or on a single sensor modality. In real life, our human interact with the world in a multi-sensory way, where diverse information sources are intricately processed and interpreted to accomplish a complex and unified sensing system. To give machines similar intelligence, multimodal machine learning, which merges data from various sources, has become a popular research area with recent advancements. In this study, we present a comprehensive survey from a novel perspective on how to leverage multimodal learning to WSHAR domain for newcomers and researchers. We begin by presenting the recent sensor modalities as well as deep learning approaches in HAR. Subsequently, we explore the techniques used in present multimodal systems for WSHAR. This includes inter-multimodal systems which utilize sensor modalities from both visual and non-visual systems and intra-multimodal systems that simply take modalities from non-visual systems. After that, we focus on current multimodal learning approaches that have applied to solve some of the challenges existing in WSHAR. Specifically, we make extra efforts by connecting the existing multimodal literature from other domains, such as computer vision and natural language processing, with current WSHAR area. Finally, we identify the corresponding challenges and potential research direction in current WSHAR area for further improvement.

A Survey on Multimodal Wearable Sensor-based Human Action Recognition

TL;DR

This survey addresses the multimodal wearable sensor–based HAR problem by cataloging visual and non-visual data modalities, their DL-driven processing, and how inter- and intra-modal fusion can improve recognition. It synthesizes multimodal HAR methods from computer vision and NLP, including data augmentation, SSL, knowledge distillation, synthetic data generation, and data-to-image transformations to leverage CV pipelines. The authors identify dataset scarcity, alignment challenges, and deployment constraints as key bottlenecks, and propose future directions such as foundation models, unified multimodal architectures, and privacy-preserving personalization. The work aims to guide newcomers and researchers toward robust, scalable, and privacy-aware WSHAR solutions with practical impact for aging populations and beyond.

Abstract

The combination of increased life expectancy and falling birth rates is resulting in an aging population. Wearable Sensor-based Human Activity Recognition (WSHAR) emerges as a promising assistive technology to support the daily lives of older individuals, unlocking vast potential for human-centric applications. However, recent surveys in WSHAR have been limited, focusing either solely on deep learning approaches or on a single sensor modality. In real life, our human interact with the world in a multi-sensory way, where diverse information sources are intricately processed and interpreted to accomplish a complex and unified sensing system. To give machines similar intelligence, multimodal machine learning, which merges data from various sources, has become a popular research area with recent advancements. In this study, we present a comprehensive survey from a novel perspective on how to leverage multimodal learning to WSHAR domain for newcomers and researchers. We begin by presenting the recent sensor modalities as well as deep learning approaches in HAR. Subsequently, we explore the techniques used in present multimodal systems for WSHAR. This includes inter-multimodal systems which utilize sensor modalities from both visual and non-visual systems and intra-multimodal systems that simply take modalities from non-visual systems. After that, we focus on current multimodal learning approaches that have applied to solve some of the challenges existing in WSHAR. Specifically, we make extra efforts by connecting the existing multimodal literature from other domains, such as computer vision and natural language processing, with current WSHAR area. Finally, we identify the corresponding challenges and potential research direction in current WSHAR area for further improvement.
Paper Structure (34 sections, 6 figures, 2 tables)

This paper contains 34 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (a) Applications of wearable sensor-based human activity recognition (HAR). (b) Typical wearable devices for the WSHAR problem. (c) Distribution of wearable devices placed on human body areas yadav2021review.
  • Figure 2: Overall structure of our survey. We first present two mainstream representations available for HAR systems (Visual and Non-Visual) and their current achievements. Next, we proceed to introduce multimodal applications to emphasis on the emergence need in wearable HAR domain. We take extra efforts by combining existing multimodal studies from other tasks to form the basis for our discussions on the existing challenges and possible future directions.
  • Figure 3: Current advanced multimodal tasks for other tasks. (a) Text-to-image generation task singer2022make. (b) Image-to-text generation task venugopalan2015sequence. (c) Text-to-image generation task qiao2019mirrorgan. (d) Reconstructing music from human brain activity denk2023brain2music. (e) Recent Sora study using diffusion models for video generation tasks peebles2023scalable.
  • Figure 4: Current approaches for multimodal WSHAR dataset scarcity problem. (a) Synthesized IMU data from motion capture datasets using skinned multi-person linear (SMPL) model huang2018deep. (b) Virtual IMU data generation pipeline from LLMs domain leng2024imugpt. (c) Virtual IMU data generation pipeline from video domain kwon2020imutube. (d) IMU data generation using advanced diffusion model peebles2023scalable.
  • Figure 5: Current approaches for limited labeled data problem. (a) MixUp method zhang2017mixup. (b) CutMix approach yun2019cutmix. (c) Self-training based SSL method tang2021selfhar.
  • ...and 1 more figures