Table of Contents
Fetching ...

Virtual Fusion with Contrastive Learning for Single Sensor-based Activity Recognition

Duc-Anh Nguyen, Cuong Pham, Nhien-An Le-Khac

TL;DR

The paper addresses the cost and privacy challenges of sensor fusion in HAR by proposing Virtual Fusion, which leverages unlabeled multimodal data during training to improve single-sensor inference. It introduces a cross-modal contrastive learning framework that aligns representations across modalities while maintaining independent classifiers for each modality, and extends this with AFVF to allow inference from a subset of training sensors. Empirically, Virtual Fusion and AFVF achieve state-of-the-art performance on benchmark HAR datasets (UCI-HAR and PAMAP2), often surpassing actual fusion in test-time accuracy, and demonstrate robustness across datasets and sensor configurations. The work highlights the practical impact of using unlabeled data to enhance modality-agnostic HAR systems and outlines future work on domain adaptation and sensor selection strategies.

Abstract

Various types of sensors can be used for Human Activity Recognition (HAR), and each of them has different strengths and weaknesses. Sometimes a single sensor cannot fully observe the user's motions from its perspective, which causes wrong predictions. While sensor fusion provides more information for HAR, it comes with many inherent drawbacks like user privacy and acceptance, costly set-up, operation, and maintenance. To deal with this problem, we propose Virtual Fusion - a new method that takes advantage of unlabeled data from multiple time-synchronized sensors during training, but only needs one sensor for inference. Contrastive learning is adopted to exploit the correlation among sensors. Virtual Fusion gives significantly better accuracy than training with the same single sensor, and in some cases, it even surpasses actual fusion using multiple sensors at test time. We also extend this method to a more general version called Actual Fusion within Virtual Fusion (AFVF), which uses a subset of training sensors during inference. Our method achieves state-of-the-art accuracy and F1-score on UCI-HAR and PAMAP2 benchmark datasets. Implementation is available upon request.

Virtual Fusion with Contrastive Learning for Single Sensor-based Activity Recognition

TL;DR

The paper addresses the cost and privacy challenges of sensor fusion in HAR by proposing Virtual Fusion, which leverages unlabeled multimodal data during training to improve single-sensor inference. It introduces a cross-modal contrastive learning framework that aligns representations across modalities while maintaining independent classifiers for each modality, and extends this with AFVF to allow inference from a subset of training sensors. Empirically, Virtual Fusion and AFVF achieve state-of-the-art performance on benchmark HAR datasets (UCI-HAR and PAMAP2), often surpassing actual fusion in test-time accuracy, and demonstrate robustness across datasets and sensor configurations. The work highlights the practical impact of using unlabeled data to enhance modality-agnostic HAR systems and outlines future work on domain adaptation and sensor selection strategies.

Abstract

Various types of sensors can be used for Human Activity Recognition (HAR), and each of them has different strengths and weaknesses. Sometimes a single sensor cannot fully observe the user's motions from its perspective, which causes wrong predictions. While sensor fusion provides more information for HAR, it comes with many inherent drawbacks like user privacy and acceptance, costly set-up, operation, and maintenance. To deal with this problem, we propose Virtual Fusion - a new method that takes advantage of unlabeled data from multiple time-synchronized sensors during training, but only needs one sensor for inference. Contrastive learning is adopted to exploit the correlation among sensors. Virtual Fusion gives significantly better accuracy than training with the same single sensor, and in some cases, it even surpasses actual fusion using multiple sensors at test time. We also extend this method to a more general version called Actual Fusion within Virtual Fusion (AFVF), which uses a subset of training sensors during inference. Our method achieves state-of-the-art accuracy and F1-score on UCI-HAR and PAMAP2 benchmark datasets. Implementation is available upon request.
Paper Structure (27 sections, 9 equations, 3 figures, 4 tables)

This paper contains 27 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overall training process of Virtual Fusion. Dotted lines are optional, depending on label availability.
  • Figure 2: Examples of AFVF that fuses 2 out of multiple modalities. The dotted line connections are only applicable if $m \in M_{lbl}$.
  • Figure 3: Example of AFVF that fuses all modalities. Early fusion is not applicable.