Table of Contents
Fetching ...

PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision

Arnav M. Das, Chi Ian Tang, Fahim Kawsar, Mohammad Malekzadeh

TL;DR

PRIMUS addresses the challenge of learning transferable IMU representations under label scarcity by multi-objectively pretraining an IMU encoder with self-supervised, multimodal, and nearest-neighbor supervision. By aligning IMU features with video and text through L_MM, enforcing augmentation invariance via L_SS, and exploiting cross-instance signals with L_NN, PRIMUS achieves substantial gains in few-shot and out-of-domain activity recognition, outperforming prior IMU pretraining methods by up to about 15 percentage points. The approach demonstrates data efficiency, robustness across domains, and practical viability for mobile wearables, with open-source code to foster community adoption. These results suggest that integrating diverse supervisory signals during pretraining yields highly transferable IMU encoders suitable for real-world health and activity monitoring applications.

Abstract

Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. Labeled IMU data is scarce, however, unlabeled or weakly labeled IMU data can be used to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data to build a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. However, pretraining methods are poorly understood for IMU data, and pipelines are rarely evaluated on out-of-domain tasks. We propose PRIMUS: a method for PRetraining IMU encoderS that uses a novel pretraining objective that is empirically validated based on downstream performance on both in-domain and out-of-domain datasets. The PRIMUS objective effectively enhances downstream performance by combining self-supervision, multimodal, and nearest-neighbor supervision. With fewer than 500 labeled samples per class, PRIMUS improves test accuracy by up to 15%, compared to state-of-the-art baselines. To benefit the broader community, we have open-sourced our code at github.com/nokia-bell-labs/pretrained-imu-encoders.

PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision

TL;DR

PRIMUS addresses the challenge of learning transferable IMU representations under label scarcity by multi-objectively pretraining an IMU encoder with self-supervised, multimodal, and nearest-neighbor supervision. By aligning IMU features with video and text through L_MM, enforcing augmentation invariance via L_SS, and exploiting cross-instance signals with L_NN, PRIMUS achieves substantial gains in few-shot and out-of-domain activity recognition, outperforming prior IMU pretraining methods by up to about 15 percentage points. The approach demonstrates data efficiency, robustness across domains, and practical viability for mobile wearables, with open-source code to foster community adoption. These results suggest that integrating diverse supervisory signals during pretraining yields highly transferable IMU encoders suitable for real-world health and activity monitoring applications.

Abstract

Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. Labeled IMU data is scarce, however, unlabeled or weakly labeled IMU data can be used to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data to build a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. However, pretraining methods are poorly understood for IMU data, and pipelines are rarely evaluated on out-of-domain tasks. We propose PRIMUS: a method for PRetraining IMU encoderS that uses a novel pretraining objective that is empirically validated based on downstream performance on both in-domain and out-of-domain datasets. The PRIMUS objective effectively enhances downstream performance by combining self-supervision, multimodal, and nearest-neighbor supervision. With fewer than 500 labeled samples per class, PRIMUS improves test accuracy by up to 15%, compared to state-of-the-art baselines. To benefit the broader community, we have open-sourced our code at github.com/nokia-bell-labs/pretrained-imu-encoders.

Paper Structure

This paper contains 11 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: PRIMUS Overview. We use a multi-objective pretraining including three terms, $\mathcal{L}_{SS}, \mathcal{L}_{MM},$ and $\mathcal{L}_{NN}$. Self-supervised losses encourage the IMU encoder to be augmentation invariant, while multimodal and nearest neighbor losses align the IMU data to co-occurring video and/or text data. We use open-source models for the text and video encoders.
  • Figure 2: The architecture of IMU Encoder $\mathcal{I}$. The backbone consists of both 1D-CNN and GRU layers. During pretraining, the IMU encoder has two MLP heads: one for multimodal loss and the other for unimodal loss. After pre-training, only the output of the multimodal head is kept for training downstream tasks, as it offers a more generalized latent representation. The architecture is adopted from imu2clip.
  • Figure 3: Nearest neighbor supervision. Given a query segment, we retrieve the most similar segment in the queue, based on video-to-video similarity, and use all modalities to derive supervisory signals for the IMU segment. Features are retrieved from a fixed-size queue.
  • Figure 4: Main Results. We report the few-shot learning performance of pretrained models on various classification datasets. PRIMUS generally outperforms self-supervised methods (SimCLR, MultitaskSSL), and prior multimodal methods (IMU2CLIP), as well as training a randomly initialized model (standard training). The standard error is computed over 5 trials.
  • Figure 5: Ablations. We assess the importance of each individual term in the PRIMUS objective, by pretraining encoders with different losses and evaluating them based on few-shot learning performance. The standard error is computed over 5 trials.
  • ...and 1 more figures