In Shift and In Variance: Assessing the Robustness of HAR Deep Learning Models against Variability
Azhar Ali Khaked, Nobuyuki Oishi, Daniel Roggen, Paula Lago
TL;DR
This work addresses the robustness of deep learning-based HAR models to real-world variability by isolating subject, device, position, and orientation factors. It combines HARVAR and REALDISP datasets with three hybrid DL-HAR architectures, an LOSO cross-validation protocol, and $MMD$ to quantify data distribution shifts and predict performance drops. The findings show orientation variability has the least impact, while device variability yields the largest performance losses, with $MMD$ generally correlating with reduced $F1$-scores but with notable exceptions due to sensor characteristics and sampling rates. The study highlights the need for diverse, variability-aware training data and cautious interpretation of distribution-shift metrics, emphasizing practical guidance for building robust HAR systems across devices and wearables for real-world healthcare and monitoring applications.
Abstract
Human Activity Recognition (HAR) using wearable inertial measurement unit (IMU) sensors can revolutionize healthcare by enabling continual health monitoring, disease prediction, and routine recognition. Despite the high accuracy of Deep Learning (DL) HAR models, their robustness to real-world variabilities remains untested, as they have primarily been trained and tested on limited lab-confined data. In this study, we isolate subject, device, position, and orientation variability to determine their effect on DL HAR models and assess the robustness of these models in real-world conditions. We evaluated the DL HAR models using the HARVAR and REALDISP datasets, providing a comprehensive discussion on the impact of variability on data distribution shifts and changes in model performance. Our experiments measured shifts in data distribution using Maximum Mean Discrepancy (MMD) and observed DL model performance drops due to variability. We concur that studied variabilities affect DL HAR models differently, and there is an inverse relationship between data distribution shifts and model performance. The compounding effect of variability was analyzed, and the implications of variabilities in real-world scenarios were highlighted. MMD proved an effective metric for calculating data distribution shifts and explained the drop in performance due to variabilities in HARVAR and REALDISP datasets. Combining our understanding of variability with evaluating its effects will facilitate the development of more robust DL HAR models and optimal training techniques. Allowing Future models to not only be assessed based on their maximum F1 score but also on their ability to generalize effectively
