MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels
Lilin Xu, Chaojie Gu, Rui Tan, Shibo He, Jiming Chen
TL;DR
MESEN addresses the practical challenge of unimodal HAR when deployment data are sparsely labeled but unlabeled multimodal data are available during model design. It introduces a two-stage pipeline with multimodal-aided pre-training that combines cross-modal feature contrastive learning ($L_{CMF}$) and multimodal pseudo-classification aligning ($L_{MPC}$), followed by a layer-aware unimodal fine-tuning stage with $L_{FT}$. The approach yields substantial improvements over state-of-the-art baselines across eight datasets, averaging +30.7% accuracy and +34.5% F1-score over supervised unimodal learning and +25.2% accuracy and +26.4% F1-score over contrastive baselines. This multimodal-to-unimodal transfer enables robust unimodal HAR in real-world scenarios with limited labels and supports deployment on edge devices.
Abstract
Human activity recognition (HAR) will be an essential function of various emerging applications. However, HAR typically encounters challenges related to modality limitations and label scarcity, leading to an application gap between current solutions and real-world requirements. In this work, we propose MESEN, a multimodal-empowered unimodal sensing framework, to utilize unlabeled multimodal data available during the HAR model design phase for unimodal HAR enhancement during the deployment phase. From a study on the impact of supervised multimodal fusion on unimodal feature extraction, MESEN is designed to feature a multi-task mechanism during the multimodal-aided pre-training stage. With the proposed mechanism integrating cross-modal feature contrastive learning and multimodal pseudo-classification aligning, MESEN exploits unlabeled multimodal data to extract effective unimodal features for each modality. Subsequently, MESEN can adapt to downstream unimodal HAR with only a few labeled samples. Extensive experiments on eight public multimodal datasets demonstrate that MESEN achieves significant performance improvements over state-of-the-art baselines in enhancing unimodal HAR by exploiting multimodal data.
