Table of Contents
Fetching ...

UniMTS: Unified Pre-training for Motion Time Series

Xiyuan Zhang, Diyan Teng, Ranak Roy Chowdhury, Shuheng Li, Dezhi Hong, Rajesh K. Gupta, Jingbo Shang

TL;DR

This paper introduces UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities, and shows exceptional generalizability across 18 motion time series classification benchmark datasets.

Abstract

Motion time series collected from mobile and wearable devices such as smartphones and smartwatches offer significant insights into human behavioral patterns, with wide applications in healthcare, automation, IoT, and AR/XR due to their low-power, always-on nature. However, given security and privacy concerns, building large-scale motion time series datasets remains difficult, preventing the development of pre-trained models for human activity analysis. Typically, existing models are trained and tested on the same dataset, leading to poor generalizability across variations in device location, device mounting orientation and human activity type. In this paper, we introduce UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities. Specifically, we employ a contrastive learning framework that aligns motion time series with text descriptions enriched by large language models. This helps the model learn the semantics of time series to generalize across activities. Given the absence of large-scale motion time series data, we derive and synthesize time series from existing motion skeleton data with all-joint coverage. Spatio-temporal graph networks are utilized to capture the relationships across joints for generalization across different device locations. We further design rotation-invariant augmentation to make the model agnostic to changes in device mounting orientations. Our model shows exceptional generalizability across 18 motion time series classification benchmark datasets, outperforming the best baselines by 340% in the zero-shot setting, 16.3% in the few-shot setting, and 9.2% in the full-shot setting.

UniMTS: Unified Pre-training for Motion Time Series

TL;DR

This paper introduces UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities, and shows exceptional generalizability across 18 motion time series classification benchmark datasets.

Abstract

Motion time series collected from mobile and wearable devices such as smartphones and smartwatches offer significant insights into human behavioral patterns, with wide applications in healthcare, automation, IoT, and AR/XR due to their low-power, always-on nature. However, given security and privacy concerns, building large-scale motion time series datasets remains difficult, preventing the development of pre-trained models for human activity analysis. Typically, existing models are trained and tested on the same dataset, leading to poor generalizability across variations in device location, device mounting orientation and human activity type. In this paper, we introduce UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities. Specifically, we employ a contrastive learning framework that aligns motion time series with text descriptions enriched by large language models. This helps the model learn the semantics of time series to generalize across activities. Given the absence of large-scale motion time series data, we derive and synthesize time series from existing motion skeleton data with all-joint coverage. Spatio-temporal graph networks are utilized to capture the relationships across joints for generalization across different device locations. We further design rotation-invariant augmentation to make the model agnostic to changes in device mounting orientations. Our model shows exceptional generalizability across 18 motion time series classification benchmark datasets, outperforming the best baselines by 340% in the zero-shot setting, 16.3% in the few-shot setting, and 9.2% in the full-shot setting.

Paper Structure

This paper contains 24 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our framework addresses all three generalization challenges (variation in device location, orientation and activity) where existing methods fall short.
  • Figure 2: UniMTS pre-training framework: The physics engine computes motion time series for each joint based on motion skeleton data and enhances time series through rotation-invariant augmentation. During pre-training, we adopt contrastive learning to align motion time series encoded by graph convolutional neural networks with corresponding text descriptions augmented by an LLM.
  • Figure 3: Inference (left) and fine-tuning (right) phases of UniMTS. We assign real signals to the nearest location in the skeleton graph. During inference, we compute similarity score between the graph embedding and each label candidate, and predict the one with the highest score. During fine-tuning, we freeze the text encoder and update weights of the graph encoder and linear layer.
  • Figure 4: Few-shot fine-tuning results. UniMTS consistently outperforms both baselines and our model ablation. We repeat 3 runs and report both mean and standard deviation.
  • Figure 5: T-SNE visualizations show that signal clusters align with their semantic meanings.
  • ...and 4 more figures