Decomposing and Fusing Intra- and Inter-Sensor Spatio-Temporal Signal for Multi-Sensor Wearable Human Activity Recognition
Haoyu Xie, Haoxuan Li, Chunyuan Zheng, Haonan Yuan, Guorui Liao, Jun Liao, Li Liu
TL;DR
The paper tackles wearable human activity recognition by disentangling intra-sensor and inter-sensor spatio-temporal relationships. It introduces DecomposeWHAR, a two-phase framework: Modality-Aware Signal Decomposition to preserve variable-specific temporal features via Modality-Specific Embedding and Local Temporal Extraction, and Hierarchical Interaction Fusion to fuse features through Cross-Channel, Cross-Variable, Global Temporal Aggregation (Mamba-based), and Cross-Sensor Interaction with self-attention. The approach achieves state-of-the-art Macro-F1 and accuracy on Opportunity, Realdisp, and Skoda while maintaining high efficiency through Depth-Wise and Point-Wise convolutions and a selective SSM-based temporal model. The results demonstrate the value of sensor-aware decomposition and dynamic inter-sensor fusion for robust WHAR, with practical implications for deployment on wearable devices. The work provides a scalable framework that can generalize to other multi-sensor time-series classification tasks and highlights the importance of directional inter-sensor relationships in recognition systems.
Abstract
Wearable Human Activity Recognition (WHAR) is a prominent research area within ubiquitous computing. Multi-sensor synchronous measurement has proven to be more effective for WHAR than using a single sensor. However, existing WHAR methods use shared convolutional kernels for indiscriminate temporal feature extraction across each sensor variable, which fails to effectively capture spatio-temporal relationships of intra-sensor and inter-sensor variables. We propose the DecomposeWHAR model consisting of a decomposition phase and a fusion phase to better model the relationships between modality variables. The decomposition creates high-dimensional representations of each intra-sensor variable through the improved Depth Separable Convolution to capture local temporal features while preserving their unique characteristics. The fusion phase begins by capturing relationships between intra-sensor variables and fusing their features at both the channel and variable levels. Long-range temporal dependencies are modeled using the State Space Model (SSM), and later cross-sensor interactions are dynamically captured through a self-attention mechanism, highlighting inter-sensor spatial correlations. Our model demonstrates superior performance on three widely used WHAR datasets, significantly outperforming state-of-the-art models while maintaining acceptable computational efficiency.
