Leveraging Foundation Models for Zero-Shot IoT Sensing
Dinghao Xue, Xiaoran Fan, Tao Chen, Guohao Lan, Qun Song
TL;DR
This work investigates zero-shot IoT sensing by aligning IoT embeddings with semantic embeddings produced by a vision-language foundation model. It introduces cross-attention-based fusion of a learnable soft prompt and an auxiliary hard prompt to create robust class prototypes, and employs a supervised contrastive objective to align IoT features with these prototypes. To mitigate bias toward seen classes, it uses GAN-based data augmentation for unseen classes and a two-stage open-set detection plus cloud-based zero-shot classification framework. Evaluation on IMU, mmWave, and Wi‑Fi datasets demonstrates improved open-set detection and generalized zero-shot learning over strong baselines, highlighting the potential of FM-driven ZSL in IoT sensing with edge-cloud cooperation.
Abstract
Deep learning models are increasingly deployed on edge Internet of Things (IoT) devices. However, these models typically operate under supervised conditions and fail to recognize unseen classes different from training. To address this, zero-shot learning (ZSL) aims to classify data of unseen classes with the help of semantic information. Foundation models (FMs) trained on web-scale data have shown impressive ZSL capability in natural language processing and visual understanding. However, leveraging FMs' generalized knowledge for zero-shot IoT sensing using signals such as mmWave, IMU, and Wi-Fi has not been fully investigated. In this work, we align the IoT data embeddings with the semantic embeddings generated by an FM's text encoder for zero-shot IoT sensing. To utilize the physics principles governing the generation of IoT sensor signals to derive more effective prompts for semantic embedding extraction, we propose to use cross-attention to combine a learnable soft prompt that is optimized automatically on training data and an auxiliary hard prompt that encodes domain knowledge of the IoT sensing task. To address the problem of IoT embeddings biasing to seen classes due to the lack of unseen class data during training, we propose using data augmentation to synthesize unseen class IoT data for fine-tuning the IoT feature extractor and embedding projector. We evaluate our approach on multiple IoT sensing tasks. Results show that our approach achieves superior open-set detection and generalized zero-shot learning performance compared with various baselines. Our code is available at https://github.com/schrodingho/FM\_ZSL\_IoT.
