Human Activity Recognition from Wearable Sensor Data Using Self-Attention
Saif Mahmud, M Tanjid Hasan Tonmoy, Kishor Kumar Bhaumik, A K M Mahbubur Rahman, M Ashraful Amin, Mohammad Shoyaib, Muhammad Asif Hossain Khan, Amin Ahsan Ali
TL;DR
This work tackles Human Activity Recognition from multi-sensor time-series data by replacing recurrent architectures with a transformer-inspired self-attention model. It introduces sensor modality attention, multi-head self-attention blocks, and a global temporal attention module to produce discriminative window-level representations without recurrence. Attention computations follow a transformer-style mechanism, with the core operation described as $softmax(QK^T / \sqrt{d_k})V$, complemented by positional encoding to preserve sequence order. Across four public HAR datasets (PAMAP2, Opportunity, USC-HAD, Skoda), the approach yields superior window-wise performance and robust Leave-One-Subject-Out generalization, while providing interpretable sensor-attention maps that indicate sensor placements' relevance to each activity.
Abstract
Human Activity Recognition from body-worn sensor data poses an inherent challenge in capturing spatial and temporal dependencies of time-series signals. In this regard, the existing recurrent or convolutional or their hybrid models for activity recognition struggle to capture spatio-temporal context from the feature space of sensor reading sequence. To address this complex problem, we propose a self-attention based neural network model that foregoes recurrent architectures and utilizes different types of attention mechanisms to generate higher dimensional feature representation used for classification. We performed extensive experiments on four popular publicly available HAR datasets: PAMAP2, Opportunity, Skoda and USC-HAD. Our model achieve significant performance improvement over recent state-of-the-art models in both benchmark test subjects and Leave-one-subject-out evaluation. We also observe that the sensor attention maps produced by our model is able capture the importance of the modality and placement of the sensors in predicting the different activity classes.
