Table of Contents
Fetching ...

Large Model for Small Data: Foundation Model for Cross-Modal RF Human Activity Recognition

Yuxuan Weng, Guoquan Wu, Tianyue Zheng, Yanbing Yang, Jun Luo

TL;DR

FM-Fi is introduced, an innovative cross-modal framework engineered to translate the knowledge of vision-based FMs for enhancing RF-based HAR systems and provides empirical validation of FM-Fi's generalizability across various environments.

Abstract

Radio-Frequency (RF)-based Human Activity Recognition (HAR) rises as a promising solution for applications unamenable to techniques requiring computer visions. However, the scarcity of labeled RF data due to their non-interpretable nature poses a significant obstacle. Thanks to the recent breakthrough of foundation models (FMs), extracting deep semantic insights from unlabeled visual data become viable, yet these vision-based FMs fall short when applied to small RF datasets. To bridge this gap, we introduce FM-Fi, an innovative cross-modal framework engineered to translate the knowledge of vision-based FMs for enhancing RF-based HAR systems. FM-Fi involves a novel cross-modal contrastive knowledge distillation mechanism, enabling an RF encoder to inherit the interpretative power of FMs for achieving zero-shot learning. It also employs the intrinsic capabilities of FM and RF to remove extraneous features for better alignment between the two modalities. The framework is further refined through metric-based few-shot learning techniques, aiming to boost the performance for predefined HAR tasks. Comprehensive evaluations evidently indicate that FM-Fi rivals the effectiveness of vision-based methodologies, and the evaluation results provide empirical validation of FM-Fi's generalizability across various environments.

Large Model for Small Data: Foundation Model for Cross-Modal RF Human Activity Recognition

TL;DR

FM-Fi is introduced, an innovative cross-modal framework engineered to translate the knowledge of vision-based FMs for enhancing RF-based HAR systems and provides empirical validation of FM-Fi's generalizability across various environments.

Abstract

Radio-Frequency (RF)-based Human Activity Recognition (HAR) rises as a promising solution for applications unamenable to techniques requiring computer visions. However, the scarcity of labeled RF data due to their non-interpretable nature poses a significant obstacle. Thanks to the recent breakthrough of foundation models (FMs), extracting deep semantic insights from unlabeled visual data become viable, yet these vision-based FMs fall short when applied to small RF datasets. To bridge this gap, we introduce FM-Fi, an innovative cross-modal framework engineered to translate the knowledge of vision-based FMs for enhancing RF-based HAR systems. FM-Fi involves a novel cross-modal contrastive knowledge distillation mechanism, enabling an RF encoder to inherit the interpretative power of FMs for achieving zero-shot learning. It also employs the intrinsic capabilities of FM and RF to remove extraneous features for better alignment between the two modalities. The framework is further refined through metric-based few-shot learning techniques, aiming to boost the performance for predefined HAR tasks. Comprehensive evaluations evidently indicate that FM-Fi rivals the effectiveness of vision-based methodologies, and the evaluation results provide empirical validation of FM-Fi's generalizability across various environments.

Paper Structure

This paper contains 33 sections, 6 equations, 23 figures.

Figures (23)

  • Figure 1: Overview of FM-Fi.
  • Figure 2: Performance of FM for HAR.
  • Figure 3: Conventional KD performance.
  • Figure 4: Minor background variations significantly alter the output embeddings of FM.
  • Figure 5: Overall design of FM-Fi.
  • ...and 18 more figures