Table of Contents
Fetching ...

MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels

Lilin Xu, Chaojie Gu, Rui Tan, Shibo He, Jiming Chen

TL;DR

MESEN addresses the practical challenge of unimodal HAR when deployment data are sparsely labeled but unlabeled multimodal data are available during model design. It introduces a two-stage pipeline with multimodal-aided pre-training that combines cross-modal feature contrastive learning ($L_{CMF}$) and multimodal pseudo-classification aligning ($L_{MPC}$), followed by a layer-aware unimodal fine-tuning stage with $L_{FT}$. The approach yields substantial improvements over state-of-the-art baselines across eight datasets, averaging +30.7% accuracy and +34.5% F1-score over supervised unimodal learning and +25.2% accuracy and +26.4% F1-score over contrastive baselines. This multimodal-to-unimodal transfer enables robust unimodal HAR in real-world scenarios with limited labels and supports deployment on edge devices.

Abstract

Human activity recognition (HAR) will be an essential function of various emerging applications. However, HAR typically encounters challenges related to modality limitations and label scarcity, leading to an application gap between current solutions and real-world requirements. In this work, we propose MESEN, a multimodal-empowered unimodal sensing framework, to utilize unlabeled multimodal data available during the HAR model design phase for unimodal HAR enhancement during the deployment phase. From a study on the impact of supervised multimodal fusion on unimodal feature extraction, MESEN is designed to feature a multi-task mechanism during the multimodal-aided pre-training stage. With the proposed mechanism integrating cross-modal feature contrastive learning and multimodal pseudo-classification aligning, MESEN exploits unlabeled multimodal data to extract effective unimodal features for each modality. Subsequently, MESEN can adapt to downstream unimodal HAR with only a few labeled samples. Extensive experiments on eight public multimodal datasets demonstrate that MESEN achieves significant performance improvements over state-of-the-art baselines in enhancing unimodal HAR by exploiting multimodal data.

MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels

TL;DR

MESEN addresses the practical challenge of unimodal HAR when deployment data are sparsely labeled but unlabeled multimodal data are available during model design. It introduces a two-stage pipeline with multimodal-aided pre-training that combines cross-modal feature contrastive learning () and multimodal pseudo-classification aligning (), followed by a layer-aware unimodal fine-tuning stage with . The approach yields substantial improvements over state-of-the-art baselines across eight datasets, averaging +30.7% accuracy and +34.5% F1-score over supervised unimodal learning and +25.2% accuracy and +26.4% F1-score over contrastive baselines. This multimodal-to-unimodal transfer enables robust unimodal HAR in real-world scenarios with limited labels and supports deployment on edge devices.

Abstract

Human activity recognition (HAR) will be an essential function of various emerging applications. However, HAR typically encounters challenges related to modality limitations and label scarcity, leading to an application gap between current solutions and real-world requirements. In this work, we propose MESEN, a multimodal-empowered unimodal sensing framework, to utilize unlabeled multimodal data available during the HAR model design phase for unimodal HAR enhancement during the deployment phase. From a study on the impact of supervised multimodal fusion on unimodal feature extraction, MESEN is designed to feature a multi-task mechanism during the multimodal-aided pre-training stage. With the proposed mechanism integrating cross-modal feature contrastive learning and multimodal pseudo-classification aligning, MESEN exploits unlabeled multimodal data to extract effective unimodal features for each modality. Subsequently, MESEN can adapt to downstream unimodal HAR with only a few labeled samples. Extensive experiments on eight public multimodal datasets demonstrate that MESEN achieves significant performance improvements over state-of-the-art baselines in enhancing unimodal HAR by exploiting multimodal data.
Paper Structure (32 sections, 10 equations, 21 figures, 2 tables)

This paper contains 32 sections, 10 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: The application scenario of MESEN. Multimodal data are available on the server for HAR model design, while the user at the edge deploys unimodal HAR with few labels.
  • Figure 2: (a) & (b): Prior works haresamudram2021contrastivesheng2022facilitatingouyang2022cosmo designed with label scarcity include the unimodal mode and the multimodal fusion mode. (c): MESEN operates in a multi-to-unimodal mode to improve unimodal HAR performance by exploiting unlabeled multimodal data.
  • Figure 3: Unimodal and multimodal recognition results on the UCI dataset. Activities from $a1$ to $a3$ are walking-related while the rest are stationary activities.
  • Figure 4: The visualization of extracted gyroscope features under three conditions.
  • Figure 5: The unimodal features extracted by Cosmo are beneficial to subsequent multimodal fusion instead of unimodal recognition.
  • ...and 16 more figures