Table of Contents
Fetching ...

Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs

Vitor Fortes Rey, Lala Shakti Swarup Ray, Xia Qingxin, Kaishun Wu, Paul Lukowicz

TL;DR

This work addresses HAR under data scarcity by leveraging abundant video data to pretrain joint representations across text, pose, and IMU. It introduces Multi^3Net, a multi-modal, multi-task framework that uses SMPL-based IMU simulation to generate high-quality synthetic IMU data from video MoCap, and trains with a combination of multi-modal contrastive learning, Pose2IMU regression, and IMU reconstruction. The pretrained encoders are then fine-tuned on limited real IMU data for downstream HAR, with the not-frozen variant consistently delivering the strongest gains, outperforming baselines on OpenPack and MM-Fit in terms of macro F1. The approach reduces dependence on large labeled IMU datasets and improves recognition of fine-grained activities, demonstrating strong practical potential for real-world wearable HAR applications.

Abstract

Due to the scarcity of labeled sensor data in HAR, prior research has turned to video data to synthesize Inertial Measurement Units (IMU) data, capitalizing on its rich activity annotations. However, generating IMU data from videos presents challenges for HAR in real-world settings, attributed to the poor quality of synthetic IMU data and its limited efficacy in subtle, fine-grained motions. In this paper, we propose Multi$^3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data. Our pretraining procedure uses videos from online repositories, aiming to learn joint representations of text, pose, and IMU simultaneously. By employing video data and contrastive learning, our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.Our experimental findings validate the effectiveness of our approach in improving HAR performance with IMU data. We demonstrate that models trained with synthetic IMU data generated from videos using our method surpass existing approaches in recognizing fine-grained activities.

Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs

TL;DR

This work addresses HAR under data scarcity by leveraging abundant video data to pretrain joint representations across text, pose, and IMU. It introduces Multi^3Net, a multi-modal, multi-task framework that uses SMPL-based IMU simulation to generate high-quality synthetic IMU data from video MoCap, and trains with a combination of multi-modal contrastive learning, Pose2IMU regression, and IMU reconstruction. The pretrained encoders are then fine-tuned on limited real IMU data for downstream HAR, with the not-frozen variant consistently delivering the strongest gains, outperforming baselines on OpenPack and MM-Fit in terms of macro F1. The approach reduces dependence on large labeled IMU datasets and improves recognition of fine-grained activities, demonstrating strong practical potential for real-world wearable HAR applications.

Abstract

Due to the scarcity of labeled sensor data in HAR, prior research has turned to video data to synthesize Inertial Measurement Units (IMU) data, capitalizing on its rich activity annotations. However, generating IMU data from videos presents challenges for HAR in real-world settings, attributed to the poor quality of synthetic IMU data and its limited efficacy in subtle, fine-grained motions. In this paper, we propose MultiNet, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data. Our pretraining procedure uses videos from online repositories, aiming to learn joint representations of text, pose, and IMU simultaneously. By employing video data and contrastive learning, our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.Our experimental findings validate the effectiveness of our approach in improving HAR performance with IMU data. We demonstrate that models trained with synthetic IMU data generated from videos using our method surpass existing approaches in recognizing fine-grained activities.
Paper Structure (15 sections, 7 equations, 4 figures, 1 table)

This paper contains 15 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Example of Ground truth IMU data and synthetic IMU data generated by Kinect-based (IMUTube and Vi2IMU) and SMPL-based (Multi$^3$Net) methods.
  • Figure 2: Overview of $\text{Multi}^3$Net architecture showcasing three steps (1) Sensor simulation (2) Multitask pretraining (3) Downstream training and evaluation.
  • Figure 3: TSNE Latent representations of the proposed approach for OpenPack test set(U0201) where each point depicts a data point in the dataset and each color represents a unique class present in the data.
  • Figure 4: Macro F1-score on different amount of IMU used for downstream task (left wrist: top,both wrists: bottom) using Baseline, DCL (only real data, real+virtual data IMUTube) and pretrained $\text{Multi}^3$Net .