MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition
Hao Zhang, Zhan Zhuang, Xuehao Wang, Xiaodong Yang, Yu Zhang
TL;DR
MoPFormer tackles interpretability and cross-domain generalization in IMU-based HAR by tokenizing multi-channel IMU streams into discrete motion primitives using a Vector Quantization codebook. A Context-Aware Embedding Module fuses primitive indices, statistical features, and sensor metadata, which a Transformer encoder then processes under a dual-task objective (MAE for self-supervised pretraining and CLS for classification). The approach achieves state-of-the-art results across six HAR benchmarks and demonstrates strong cross-dataset transfer, with learned primitives offering tangible interpretability through similarity, frequency, and transition analyses. This motion-primitive, transformer-based framework advances practical wearable sensing by combining robust performance with intelligible representations that reflect fundamental movement patterns across datasets.
Abstract
Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two stages. The first stage is to partition multi-channel sensor streams into short segments and quantize them into discrete ``motion primitive'' codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. More importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities, regardless of dataset origin.
