Table of Contents
Fetching ...

MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

Stefan Gerd Fritsch, Cennet Oguz, Vitor Fortes Rey, Lala Ray, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz

TL;DR

MuJo introduces a multimodal joint feature space for HAR by pre-training on FiMAD, a large YouTube-based dataset with video, pose, simulated IMU, and text. The core idea is to align representations across four modalities via pairwise contrastive learning, enabling data-efficient improvements on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. Empirical results show gains in both unimodal and multimodal settings, with strong data efficiency and robust performance when training data are scarce, though domain shifts can limit zero-shot transfer. The combination of FiMAD and MuJo demonstrates the value of leveraging synthetic sensor data and cross-modal supervision to enhance HAR across real-world scenarios, while highlighting avenues for future domain expansion and improved NULL-class handling.

Abstract

Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas, including healthcare, sports and fitness, security, and more. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundation models, can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g., in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. To alleviate the need for labeled data, we introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this work, which can be used with the proposed pre-training method MuJo (Multimodal Joint Feature Space Learning) to enhance HAR performance across various modalities. FiMAD was created using YouTube fitness videos and contains parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes this dataset to learn a joint feature space for these modalities. We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on MM-Fit, we achieve a Macro F1-Score of up to 0.855 when fine-tuning on only 2% of the training data and 0.942 when utilizing the complete training set for classification tasks. We compare our approach with other self-supervised ones and show that, unlike them, ours consistently improves compared to the baseline network performance while also providing better data efficiency.

MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

TL;DR

MuJo introduces a multimodal joint feature space for HAR by pre-training on FiMAD, a large YouTube-based dataset with video, pose, simulated IMU, and text. The core idea is to align representations across four modalities via pairwise contrastive learning, enabling data-efficient improvements on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. Empirical results show gains in both unimodal and multimodal settings, with strong data efficiency and robust performance when training data are scarce, though domain shifts can limit zero-shot transfer. The combination of FiMAD and MuJo demonstrates the value of leveraging synthetic sensor data and cross-modal supervision to enhance HAR across real-world scenarios, while highlighting avenues for future domain expansion and improved NULL-class handling.

Abstract

Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas, including healthcare, sports and fitness, security, and more. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundation models, can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g., in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. To alleviate the need for labeled data, we introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this work, which can be used with the proposed pre-training method MuJo (Multimodal Joint Feature Space Learning) to enhance HAR performance across various modalities. FiMAD was created using YouTube fitness videos and contains parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes this dataset to learn a joint feature space for these modalities. We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on MM-Fit, we achieve a Macro F1-Score of up to 0.855 when fine-tuning on only 2% of the training data and 0.942 when utilizing the complete training set for classification tasks. We compare our approach with other self-supervised ones and show that, unlike them, ours consistently improves compared to the baseline network performance while also providing better data efficiency.
Paper Structure (20 sections, 4 equations, 3 figures, 4 tables)

This paper contains 20 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The pipeline depicts the construction process of FiMAD and the training of MuJo for multimodal joint feature space learning. The asterisk (*) indicates that the input is being pre-calculated (frozen) and not optimized during the training process.
  • Figure 2: Classification performance across various methods and training data fractions, where each boxplot represents the results of 20 runs. The comparison includes the baseline model and models with a pre-trained encoder and projection with trainable weights on accelerometer data for all evaluated datasets.
  • Figure 3: Classification performance across various training data fractions on MM-Fit, where each boxplot represents the results of 20 runs. The baseline model is compared to models with a pre-trained encoder and projection (both with frozen and trainable weights) on all input modalities (sensor, pose, video, and multimodal).