Table of Contents
Fetching ...

Transformer-Based Approaches for Sensor-Based Human Activity Recognition: Opportunities and Challenges

Clayton Souza Leite, Henry Mauranen, Aziza Zhanabatyrova, Yu Xiao

TL;DR

It is observed that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices.

Abstract

Transformers have excelled in natural language processing and computer vision, paving their way to sensor-based Human Activity Recognition (HAR). Previous studies show that transformers outperform their counterparts exclusively when they harness abundant data or employ compute-intensive optimization algorithms. However, neither of these scenarios is viable in sensor-based HAR due to the scarcity of data in this field and the frequent need to perform training and inference on resource-constrained devices. Our extensive investigation into various implementations of transformer-based versus non-transformer-based HAR using wearable sensors, encompassing more than 500 experiments, corroborates these concerns. We observe that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices. Additionally, transformers demonstrate lower robustness to adversarial attacks, posing a potential threat to user trust in HAR.

Transformer-Based Approaches for Sensor-Based Human Activity Recognition: Opportunities and Challenges

TL;DR

It is observed that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices.

Abstract

Transformers have excelled in natural language processing and computer vision, paving their way to sensor-based Human Activity Recognition (HAR). Previous studies show that transformers outperform their counterparts exclusively when they harness abundant data or employ compute-intensive optimization algorithms. However, neither of these scenarios is viable in sensor-based HAR due to the scarcity of data in this field and the frequent need to perform training and inference on resource-constrained devices. Our extensive investigation into various implementations of transformer-based versus non-transformer-based HAR using wearable sensors, encompassing more than 500 experiments, corroborates these concerns. We observe that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices. Additionally, transformers demonstrate lower robustness to adversarial attacks, posing a potential threat to user trust in HAR.

Paper Structure

This paper contains 13 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Timeline of the studied neural network architectures. InnoHAR, TEHAR, HART, ResBiLSTM, TTN, and TASKED are in the field of HAR. With the exception of TTN which improves over TEHAR, these architectures have been developed in a scattered manner, lacking a clear interrelation and comparison among each other. With the exception of the original Transformer, we implement all the nine depicted neural network architectures.
  • Figure 2: Loss landscapes of the trained architectures for PAMAP2. The x and y axes vary from -3 to 3. The z-axis ranges from 0 to 15.
  • Figure 3: Trace of the Hessian and maximum eigenvalue across all datasets. The y-axis is expressed in a logarithm scale. Lower is better.
  • Figure 4: Average performance against inference time (in seconds) across all datasets, where inference time measures the duration required to infer all examples from the test set. TASKED and TASKED-SAM achieve average performances of 0.780 and 0.802, respectively. Their inference time is 35.76s.
  • Figure 5: Average performance against training time (in seconds) across all datasets, where training time represents the duration for an entire training pass over the training set.