Table of Contents
Fetching ...

In Shift and In Variance: Assessing the Robustness of HAR Deep Learning Models against Variability

Azhar Ali Khaked, Nobuyuki Oishi, Daniel Roggen, Paula Lago

TL;DR

This work addresses the robustness of deep learning-based HAR models to real-world variability by isolating subject, device, position, and orientation factors. It combines HARVAR and REALDISP datasets with three hybrid DL-HAR architectures, an LOSO cross-validation protocol, and $MMD$ to quantify data distribution shifts and predict performance drops. The findings show orientation variability has the least impact, while device variability yields the largest performance losses, with $MMD$ generally correlating with reduced $F1$-scores but with notable exceptions due to sensor characteristics and sampling rates. The study highlights the need for diverse, variability-aware training data and cautious interpretation of distribution-shift metrics, emphasizing practical guidance for building robust HAR systems across devices and wearables for real-world healthcare and monitoring applications.

Abstract

Human Activity Recognition (HAR) using wearable inertial measurement unit (IMU) sensors can revolutionize healthcare by enabling continual health monitoring, disease prediction, and routine recognition. Despite the high accuracy of Deep Learning (DL) HAR models, their robustness to real-world variabilities remains untested, as they have primarily been trained and tested on limited lab-confined data. In this study, we isolate subject, device, position, and orientation variability to determine their effect on DL HAR models and assess the robustness of these models in real-world conditions. We evaluated the DL HAR models using the HARVAR and REALDISP datasets, providing a comprehensive discussion on the impact of variability on data distribution shifts and changes in model performance. Our experiments measured shifts in data distribution using Maximum Mean Discrepancy (MMD) and observed DL model performance drops due to variability. We concur that studied variabilities affect DL HAR models differently, and there is an inverse relationship between data distribution shifts and model performance. The compounding effect of variability was analyzed, and the implications of variabilities in real-world scenarios were highlighted. MMD proved an effective metric for calculating data distribution shifts and explained the drop in performance due to variabilities in HARVAR and REALDISP datasets. Combining our understanding of variability with evaluating its effects will facilitate the development of more robust DL HAR models and optimal training techniques. Allowing Future models to not only be assessed based on their maximum F1 score but also on their ability to generalize effectively

In Shift and In Variance: Assessing the Robustness of HAR Deep Learning Models against Variability

TL;DR

This work addresses the robustness of deep learning-based HAR models to real-world variability by isolating subject, device, position, and orientation factors. It combines HARVAR and REALDISP datasets with three hybrid DL-HAR architectures, an LOSO cross-validation protocol, and to quantify data distribution shifts and predict performance drops. The findings show orientation variability has the least impact, while device variability yields the largest performance losses, with generally correlating with reduced -scores but with notable exceptions due to sensor characteristics and sampling rates. The study highlights the need for diverse, variability-aware training data and cautious interpretation of distribution-shift metrics, emphasizing practical guidance for building robust HAR systems across devices and wearables for real-world healthcare and monitoring applications.

Abstract

Human Activity Recognition (HAR) using wearable inertial measurement unit (IMU) sensors can revolutionize healthcare by enabling continual health monitoring, disease prediction, and routine recognition. Despite the high accuracy of Deep Learning (DL) HAR models, their robustness to real-world variabilities remains untested, as they have primarily been trained and tested on limited lab-confined data. In this study, we isolate subject, device, position, and orientation variability to determine their effect on DL HAR models and assess the robustness of these models in real-world conditions. We evaluated the DL HAR models using the HARVAR and REALDISP datasets, providing a comprehensive discussion on the impact of variability on data distribution shifts and changes in model performance. Our experiments measured shifts in data distribution using Maximum Mean Discrepancy (MMD) and observed DL model performance drops due to variability. We concur that studied variabilities affect DL HAR models differently, and there is an inverse relationship between data distribution shifts and model performance. The compounding effect of variability was analyzed, and the implications of variabilities in real-world scenarios were highlighted. MMD proved an effective metric for calculating data distribution shifts and explained the drop in performance due to variabilities in HARVAR and REALDISP datasets. Combining our understanding of variability with evaluating its effects will facilitate the development of more robust DL HAR models and optimal training techniques. Allowing Future models to not only be assessed based on their maximum F1 score but also on their ability to generalize effectively

Paper Structure

This paper contains 25 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Placement of sensors in HARVAR data collection. The Empatica Embrace Plus and Bluesense sensors are placed in the same coordinate system, and their axis is marked. BR2, marked as red, is tilted across the Z-axis at 45-degree of rotation. In this diagram, the person is facing towards the reader
  • Figure 2: The experiment setting using the HARVAR dataset to evaluate the effect of device, position, and orientation variability. Where Sensor 1 and Sensor 2 are used in combination as a train-test pair to highlight variability. In these diagrams, the person is facing towards the reader.
  • Figure 3: The process of evaluating the effect of variability using the HARVAR dataset.
  • Figure 4: Performance changes due to Orientation Variability. We show the average F1 score and average MMD values for each DL HAR model in the two experiments. Light green bars represent the no variability setting of each experiment, and dark green bars represent the variability setting. Asterisks represent the p-value of a paired t-test (*: p-value <0.05, **: p-value <0.01, ***: p-value<0.001). Only two models in one experiment showed significant performance changes, but the F1-Score remains above 0.7.
  • Figure 5: Performance changes due to Positional Variability. Bars represent the average F1 score for each DL HAR model and the lines represent the average MMD values of the settings. Light green bars represent the no variability setting; dark green bars represent the setting with variability. Asterisks represent the p-value of a paired t-test (*: p-value <0.05, **: p-value <0.01, ***: p-value<0.001). Significant performance changes were found for all models when BlueSense sensors were used but not for Empatica sensors.
  • ...and 5 more figures