Self-supervised Learning Method Using Transformer for Multi-dimensional Sensor Data Processing
Haruki Kai, Tsuyoshi Okita
TL;DR
This work addresses human activity recognition from multi-dimensional sensor data by adapting NLP Transformer models into an n-dimensional numerical processing framework. It introduces an embedding stage via a linear projection, a binning-based self-supervised pretraining signal, and parallel per-dimension output heads, complemented by three pretraining tasks (MLM, Reconstruction, Next Token Prediction). Across five HAR datasets, the approach, especially with MLM pretraining, achieves up to 10–15% gains over a vanilla Transformer and often surpasses ResNet/RF baselines, albeit with higher training and memory costs. The study highlights practical trade-offs for edge deployment and suggests future work on tailoring MLM difficulty and linking pretraining dynamics to dataset characteristics. Overall, the method demonstrates that combining NLP-style Transformers with numerically grounded embeddings can improve sensor-based HAR performance, with MLM pretraining offering the most robust downstream benefits.
Abstract
We developed a deep learning algorithm for human activity recognition using sensor signals as input. In this study, we built a pretrained language model based on the Transformer architecture, which is widely used in natural language processing. By leveraging this pretrained model, we aimed to improve performance on the downstream task of human activity recognition. While this task can be addressed using a vanilla Transformer, we propose an enhanced n-dimensional numerical processing Transformer that incorporates three key features: embedding n-dimensional numerical data through a linear layer, binning-based pre-processing, and a linear transformation in the output layer. We evaluated the effectiveness of our proposed model across five different datasets. Compared to the vanilla Transformer, our model demonstrated 10%-15% improvements in accuracy.
