Table of Contents
Fetching ...

Leveraging Foundation Models for Zero-Shot IoT Sensing

Dinghao Xue, Xiaoran Fan, Tao Chen, Guohao Lan, Qun Song

TL;DR

This work investigates zero-shot IoT sensing by aligning IoT embeddings with semantic embeddings produced by a vision-language foundation model. It introduces cross-attention-based fusion of a learnable soft prompt and an auxiliary hard prompt to create robust class prototypes, and employs a supervised contrastive objective to align IoT features with these prototypes. To mitigate bias toward seen classes, it uses GAN-based data augmentation for unseen classes and a two-stage open-set detection plus cloud-based zero-shot classification framework. Evaluation on IMU, mmWave, and Wi‑Fi datasets demonstrates improved open-set detection and generalized zero-shot learning over strong baselines, highlighting the potential of FM-driven ZSL in IoT sensing with edge-cloud cooperation.

Abstract

Deep learning models are increasingly deployed on edge Internet of Things (IoT) devices. However, these models typically operate under supervised conditions and fail to recognize unseen classes different from training. To address this, zero-shot learning (ZSL) aims to classify data of unseen classes with the help of semantic information. Foundation models (FMs) trained on web-scale data have shown impressive ZSL capability in natural language processing and visual understanding. However, leveraging FMs' generalized knowledge for zero-shot IoT sensing using signals such as mmWave, IMU, and Wi-Fi has not been fully investigated. In this work, we align the IoT data embeddings with the semantic embeddings generated by an FM's text encoder for zero-shot IoT sensing. To utilize the physics principles governing the generation of IoT sensor signals to derive more effective prompts for semantic embedding extraction, we propose to use cross-attention to combine a learnable soft prompt that is optimized automatically on training data and an auxiliary hard prompt that encodes domain knowledge of the IoT sensing task. To address the problem of IoT embeddings biasing to seen classes due to the lack of unseen class data during training, we propose using data augmentation to synthesize unseen class IoT data for fine-tuning the IoT feature extractor and embedding projector. We evaluate our approach on multiple IoT sensing tasks. Results show that our approach achieves superior open-set detection and generalized zero-shot learning performance compared with various baselines. Our code is available at https://github.com/schrodingho/FM\_ZSL\_IoT.

Leveraging Foundation Models for Zero-Shot IoT Sensing

TL;DR

This work investigates zero-shot IoT sensing by aligning IoT embeddings with semantic embeddings produced by a vision-language foundation model. It introduces cross-attention-based fusion of a learnable soft prompt and an auxiliary hard prompt to create robust class prototypes, and employs a supervised contrastive objective to align IoT features with these prototypes. To mitigate bias toward seen classes, it uses GAN-based data augmentation for unseen classes and a two-stage open-set detection plus cloud-based zero-shot classification framework. Evaluation on IMU, mmWave, and Wi‑Fi datasets demonstrates improved open-set detection and generalized zero-shot learning over strong baselines, highlighting the potential of FM-driven ZSL in IoT sensing with edge-cloud cooperation.

Abstract

Deep learning models are increasingly deployed on edge Internet of Things (IoT) devices. However, these models typically operate under supervised conditions and fail to recognize unseen classes different from training. To address this, zero-shot learning (ZSL) aims to classify data of unseen classes with the help of semantic information. Foundation models (FMs) trained on web-scale data have shown impressive ZSL capability in natural language processing and visual understanding. However, leveraging FMs' generalized knowledge for zero-shot IoT sensing using signals such as mmWave, IMU, and Wi-Fi has not been fully investigated. In this work, we align the IoT data embeddings with the semantic embeddings generated by an FM's text encoder for zero-shot IoT sensing. To utilize the physics principles governing the generation of IoT sensor signals to derive more effective prompts for semantic embedding extraction, we propose to use cross-attention to combine a learnable soft prompt that is optimized automatically on training data and an auxiliary hard prompt that encodes domain knowledge of the IoT sensing task. To address the problem of IoT embeddings biasing to seen classes due to the lack of unseen class data during training, we propose using data augmentation to synthesize unseen class IoT data for fine-tuning the IoT feature extractor and embedding projector. We evaluate our approach on multiple IoT sensing tasks. Results show that our approach achieves superior open-set detection and generalized zero-shot learning performance compared with various baselines. Our code is available at https://github.com/schrodingho/FM\_ZSL\_IoT.
Paper Structure (22 sections, 4 equations, 4 figures, 3 tables)

This paper contains 22 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Approach overview. In §\ref{['sec_4_1']}, we use cross-attention to combine the soft and hard prompts to generate class prototypes. In §\ref{['sec_4_2']}, we use a feature extractor followed by an embedding projector to generate IoT embeddings. During model training in §\ref{['sec_4_3']}, we use supervised contrastive learning to align the class prototypes and IoT embeddings. We then use data augmentation to synthesize unseen class data for fine-tuning the IoT feature extractor and embedding projector. During zero-shot classification in §\ref{['sec_4_4']}, we first extract the IoT embeddings of input data for open-set detection. Then, the samples detected as seen class will be classified by the specialist model on edge devices. The samples detected as unseen will be uploaded to the cloud for zero-shot classification.
  • Figure 2: Visualization of two data samples from an IMU activity recognition dataset zhang2012usc. X, Y, and Z axes are aligned with gravity, walking direction, and perpendicular to walking direction, respectively. The data sample of class "walking forward" has around zero values in the Y-axis of the accelerometer reading, indicating a constant speed along the walking direction. The data sample of "jumping up" has large positive values in the X-axis of the accelerometer reading, indicating vertical movements upwards. The patterns of the samples are characterized by the generated descriptive text.
  • Figure 3: Soft learnable prompt optimization for CLIP zhou2022learning. The learnable context is composed of continuous vector $\mathbf{l}_i$ that can be optimized during learning. $[\mathrm{CLASS}]$ is the tokenized class label embedding. The parameters of $[\mathrm{CLASS}]$ and the text encoder are frozen during training.
  • Figure 4: T-SNE visualization of PAMAP2 reiss2012introducing testing data. The classes with (S) suffix are seen classes and the classes with (U) suffix are unseen classes.