Table of Contents
Fetching ...

MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

Hao Zhang, Zhan Zhuang, Xuehao Wang, Xiaodong Yang, Yu Zhang

TL;DR

MoPFormer tackles interpretability and cross-domain generalization in IMU-based HAR by tokenizing multi-channel IMU streams into discrete motion primitives using a Vector Quantization codebook. A Context-Aware Embedding Module fuses primitive indices, statistical features, and sensor metadata, which a Transformer encoder then processes under a dual-task objective (MAE for self-supervised pretraining and CLS for classification). The approach achieves state-of-the-art results across six HAR benchmarks and demonstrates strong cross-dataset transfer, with learned primitives offering tangible interpretability through similarity, frequency, and transition analyses. This motion-primitive, transformer-based framework advances practical wearable sensing by combining robust performance with intelligible representations that reflect fundamental movement patterns across datasets.

Abstract

Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two stages. The first stage is to partition multi-channel sensor streams into short segments and quantize them into discrete ``motion primitive'' codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. More importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities, regardless of dataset origin.

MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

TL;DR

MoPFormer tackles interpretability and cross-domain generalization in IMU-based HAR by tokenizing multi-channel IMU streams into discrete motion primitives using a Vector Quantization codebook. A Context-Aware Embedding Module fuses primitive indices, statistical features, and sensor metadata, which a Transformer encoder then processes under a dual-task objective (MAE for self-supervised pretraining and CLS for classification). The approach achieves state-of-the-art results across six HAR benchmarks and demonstrates strong cross-dataset transfer, with learned primitives offering tangible interpretability through similarity, frequency, and transition analyses. This motion-primitive, transformer-based framework advances practical wearable sensing by combining robust performance with intelligible representations that reflect fundamental movement patterns across datasets.

Abstract

Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two stages. The first stage is to partition multi-channel sensor streams into short segments and quantize them into discrete ``motion primitive'' codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. More importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities, regardless of dataset origin.

Paper Structure

This paper contains 43 sections, 13 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Architecture of the proposed MoPFormer model, showcasing the flow from raw data windows through tokenization, embedding, Transformer encoding, to task-specific heads.
  • Figure 2: Detailed illustration of key modules in our motion-centric framework. (a) Motion Primitive Module: raw data windows from multiple sensor channels are partitioned into motion primitives and processed through instance normalization to generate Vector Quantization (VQ) indices and statistical features. (b) Context-Aware Embedding Module: special tokens are inserted alongside masked motion embedding tokens ($M$) to form the complete input representation, which combines motion primitive embeddings, positional encodings, and statistical feature embeddings. (c) Task Heads: The diagram shows transformed features $X^*$ representing the corresponding position vectors after Transformer Encoder processing. The MAE head utilizes $M^*$ positions for masked token prediction during pretraining, while the trainable CLS head operates exclusively on the transformed [CLS] token representation for downstream task fine-tuning.
  • Figure 3: Motion primitive segmentation and similarity analysis. (a) 5-second raw accelerometer trace from USC-HAD dataset, segmented into ten 0.5-second motion primitives. (b) Cosine-similarity matrix of motion primitive embeddings from accelerometer data. (c) Corresponding 5-second gyroscope trace with identical segmentation. (d) Cosine-similarity matrix of embeddings for gyroscope-based motion primitives. The matrices reveal pattern correlations between different motion primitives after Motion Primitive Embedding processing.
  • Figure 4: Frequency and activity composition of the 32 most common motion primitives from PAMAP2.The stacked bars (left axis) show the proportion of each activity label for every VQ index, while the black line (right axis) plots the absolute occurrence count of motion primitives.
  • Figure 5: Motion primitive usage distribution for ascending stairs activity.
  • ...and 8 more figures