Table of Contents
Fetching ...

UniMotion: Self-Supervised Learning for Cross-Domain IMU Motion Recognition

Prerna Khanna, Tanmay Srivastava, Shubham Jain, Aruna Balasubramanian

Abstract

IMU-based gesture interfaces are being increasingly adopted as efficient, accessible, and intuitive alternatives to traditional input methods, such as touchscreens and voice. However, current gesture recognition algorithms are tailored to work for specific devices (e.g., smartwatches vs. earbuds) or user populations (e.g., blind vs. sighted users), limiting their generalizability. In this paper, we design UniMotion, a generalized IMU-based gesture recognition framework that works across devices and populations with minimal training samples. To overcome the challenges and high cost of collecting large-scale labeled training data, UniMotion leverages readily available unlabeled human activity data. The UniMotion pipeline comprises two stages: (1) pre-training a motion representation model using abundant unlabeled human activity data, and (2) fine-tuning it with a small amount of labeled gesture data. For pre-training, we introduce a token-based strategy and embeddings that learn to identify and focus attention on the key motion signatures in the temporal data For fine-tuning, we design a text-guided classifier that can reliably differentiate between temporally or semantically similar gestures. We evaluate UniMotion across both hand gestures (captured through a smartwatch) and earbud gestures (captured through earbuds), using data collected from blind and sighted users. Across these diverse devices and user populations, UniMotion achieves an accuracy of 85\%, across an average of 13 gesture classes using only 10\% of labeled data for training. UniMotion significantly outperforms state-of-the-art self-supervised learning approaches and specialized gesture recognition models.

UniMotion: Self-Supervised Learning for Cross-Domain IMU Motion Recognition

Abstract

IMU-based gesture interfaces are being increasingly adopted as efficient, accessible, and intuitive alternatives to traditional input methods, such as touchscreens and voice. However, current gesture recognition algorithms are tailored to work for specific devices (e.g., smartwatches vs. earbuds) or user populations (e.g., blind vs. sighted users), limiting their generalizability. In this paper, we design UniMotion, a generalized IMU-based gesture recognition framework that works across devices and populations with minimal training samples. To overcome the challenges and high cost of collecting large-scale labeled training data, UniMotion leverages readily available unlabeled human activity data. The UniMotion pipeline comprises two stages: (1) pre-training a motion representation model using abundant unlabeled human activity data, and (2) fine-tuning it with a small amount of labeled gesture data. For pre-training, we introduce a token-based strategy and embeddings that learn to identify and focus attention on the key motion signatures in the temporal data For fine-tuning, we design a text-guided classifier that can reliably differentiate between temporally or semantically similar gestures. We evaluate UniMotion across both hand gestures (captured through a smartwatch) and earbud gestures (captured through earbuds), using data collected from blind and sighted users. Across these diverse devices and user populations, UniMotion achieves an accuracy of 85\%, across an average of 13 gesture classes using only 10\% of labeled data for training. UniMotion significantly outperforms state-of-the-art self-supervised learning approaches and specialized gesture recognition models.
Paper Structure (34 sections, 8 equations, 12 figures, 3 tables)

This paper contains 34 sections, 8 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 2: Comparison of random masking limu-bert (yellow shaded regions) on (a) Continuous HAR and (b) Short-duration Gestures. In HAR (a), the model reconstructs the signal by leveraging the surrounding temporal context of repetitive activities. In contrast, for gestures (b), the motion is brief and non-repetitive; random masking fails to capture the fine-grained gesture nuances.
  • Figure 3: Both (a) walking and (b) swipe gestures exhibit three-phase patterns: preparation, energy-rich nucleus, and retraction. The nucleus contains discriminative information despite differences in duration.
  • Figure 4: Stage 1: Token-based pre-training. The model learns from unlabeled activity data by applying focused masking to the nucleus (high-energy region), while nucleus and significant axis encodings guide attention to discriminative motion patterns.
  • Figure 5: Comparison of attention patterns across different motion sequences with and without token-based pre-training. The heatmaps show attention weights between sequence positions, with brighter colors indicating stronger attention. Red rectangles highlight the nucleus regions of each motion activity. Token-based pre-training produces focused vertical attention bands in the nucleus regions; without this guidance, the attention is scattered. On average, token-based pre-training reduced reconstruction MSE by 8.7% for these sequences.
  • Figure 6: Stage 2: Text-guided Contrastive Classifier. Motion embeddings from the pre-trained model are combined with semantic embeddings derived from gesture descriptions through BERT. The classifier uses semantic and contrastive losses to organize the embedding space, enabling accurate gesture classification with minimal labeled examples.
  • ...and 7 more figures