Table of Contents
Fetching ...

Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task

Harish Haresamudram, Chi Ian Tang, Sungho Suh, Paul Lukowicz, Thomas Ploetz

TL;DR

This survey traces the evolution of sensor-based HAR from hand-crafted features and the Activity Recognition Chain to end-to-end deep learning, identifying core limitations such as data scarcity, information coding, and signal ambivalence. It highlights a current shift toward learning rich representations through self-supervised and contrastive learning, including multi-device and multi-modal approaches, and reviews data generation/augmentation strategies (video-to-IMU, GANs, diffusion, simulations). The paper then argues for a third paradigm: leveraging foundational models (LLMs and TS foundation models) to inject world knowledge, perform cross-modal alignment, and generate diverse synthetic data to improve generalization in real-world HAR. It provides a hands-on tutorial and accompanying code to guide practitioners in building practical HAR systems that scale beyond traditional benchmarks. Significantly, the work outlines actionable directions for integrating foundational models to address long-standing HAR challenges and broaden the applicability of wearables in health, sport, and industrial settings.

Abstract

In the many years since the inception of wearable sensor-based Human Activity Recognition (HAR), a wide variety of methods have been introduced and evaluated for their ability to recognize activities. Substantial gains have been made since the days of hand-crafting heuristics as features, yet, progress has seemingly stalled on many popular benchmarks, with performance falling short of what may be considered 'sufficient'-- despite the increase in computational power and scale of sensor data, as well as rising complexity in techniques being employed. The HAR community approaches a new paradigm shift, this time incorporating world knowledge from foundational models. In this paper, we take stock of sensor-based HAR -- surveying it from its beginnings to the current state of the field, and charting its future. This is accompanied by a hands-on tutorial, through which we guide practitioners in developing HAR systems for real-world application scenarios. We provide a compendium for novices and experts alike, of methods that aim at finally solving the activity recognition problem.

Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task

TL;DR

This survey traces the evolution of sensor-based HAR from hand-crafted features and the Activity Recognition Chain to end-to-end deep learning, identifying core limitations such as data scarcity, information coding, and signal ambivalence. It highlights a current shift toward learning rich representations through self-supervised and contrastive learning, including multi-device and multi-modal approaches, and reviews data generation/augmentation strategies (video-to-IMU, GANs, diffusion, simulations). The paper then argues for a third paradigm: leveraging foundational models (LLMs and TS foundation models) to inject world knowledge, perform cross-modal alignment, and generate diverse synthetic data to improve generalization in real-world HAR. It provides a hands-on tutorial and accompanying code to guide practitioners in building practical HAR systems that scale beyond traditional benchmarks. Significantly, the work outlines actionable directions for integrating foundational models to address long-standing HAR challenges and broaden the applicability of wearables in health, sport, and industrial settings.

Abstract

In the many years since the inception of wearable sensor-based Human Activity Recognition (HAR), a wide variety of methods have been introduced and evaluated for their ability to recognize activities. Substantial gains have been made since the days of hand-crafting heuristics as features, yet, progress has seemingly stalled on many popular benchmarks, with performance falling short of what may be considered 'sufficient'-- despite the increase in computational power and scale of sensor data, as well as rising complexity in techniques being employed. The HAR community approaches a new paradigm shift, this time incorporating world knowledge from foundational models. In this paper, we take stock of sensor-based HAR -- surveying it from its beginnings to the current state of the field, and charting its future. This is accompanied by a hands-on tutorial, through which we guide practitioners in developing HAR systems for real-world application scenarios. We provide a compendium for novices and experts alike, of methods that aim at finally solving the activity recognition problem.

Paper Structure

This paper contains 55 sections, 2 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: The Activity Recognition Chain as summarized in bulling2014tutorial.
  • Figure 2: An overview of the self-supervised learning pipeline. Reproduced with permission from haresamudram2022assessing.
  • Figure 3: An overview of contrastive learning (CL). Data samples (images or sensor signals depending on the target task) are first passed through an encoder to obtain embeddings in a latent space. These embeddings are pushed closer to each other or pulled apart depending on whether the embedding forms a positive pair or a negative pair with the anchor.
  • Figure 4: Contrastive learning adaptations in human activity recognition. A comparison is drawn among SimCLR tang2020exploring, CPC haresamudram2021contrastive, BYOL haresamudram2022assessing and SimSiam haresamudram2022assessing. A high degree of commonality can be found among these frameworks, especially in the use of augmented anchors as positive samples for SimCLR, BYOL, and SimSiam. The CPC differs from the rest by using future samples instead of augmented views. SimCLR and CPC leverage time-misaligned samples as negatives, while BYOL and SimSiam leverage additional mechanisms to remove the requirement of negative samples.
  • Figure 5: An overview of multi-device and multi-modal contrastive learning frameworks in human activity recognition. A comparison is drawn among ColloSSL jain2022collossl, Learning from the Best (LftB) fortes2022learning, COCOA deldari2022cocoa and SimCLR tang2020exploring (as a single-device single-modality reference). Instead of relying on augmentations (as in SimCLR), these multi-device and multi-modal approaches leverage data from time-aligned data from different devices (ColloSSL and LftB), and different modalities (COCOA) for positive samples. Time-misaligned samples as negatives is a common feature among these approaches, with different frameworks imposing additional limitations on the sampling algorithm.
  • ...and 10 more figures