Table of Contents
Fetching ...

Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts

Zhenkui Yang, Zeyi Huang, Ge Wang, Han Ding, Tony Xiao Han, Fei Wang

TL;DR

This work tackles the underutilization of textual information in wireless sensing for human action recognition and localization. It introduces WiTalk, a text-guided, multimodal framework that injects semantic prompts via a lightweight textual branch without modifying sensing architectures. By leveraging hierarchical prompts and LLM-based semantic enrichment, WiTalk achieves consistent gains across HAR and TAL tasks on three public datasets, with improvements up to 13.68% in mAP on XRFV2 and average gains around 5% on WiFiTAL. The study demonstrates that text semantics offer a low-cost, scalable enhancement for wireless sensing, improving robustness and generalization while maintaining privacy advantages.

Abstract

Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in https://github.com/yangzhenkui/WiTalk.

Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts

TL;DR

This work tackles the underutilization of textual information in wireless sensing for human action recognition and localization. It introduces WiTalk, a text-guided, multimodal framework that injects semantic prompts via a lightweight textual branch without modifying sensing architectures. By leveraging hierarchical prompts and LLM-based semantic enrichment, WiTalk achieves consistent gains across HAR and TAL tasks on three public datasets, with improvements up to 13.68% in mAP on XRFV2 and average gains around 5% on WiFiTAL. The study demonstrates that text semantics offer a low-cost, scalable enhancement for wireless sensing, improving robustness and generalization while maintaining privacy advantages.

Abstract

Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in https://github.com/yangzhenkui/WiTalk.

Paper Structure

This paper contains 30 sections, 14 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: WiTalk transforms raw motion labels into hierarchical semantic representations. By leveraging large language models (LLMs), these enriched annotations enhance the performance of wireless sensing models for training and inference, enabling robust action recognition through integrated contextual semantics.
  • Figure 2: The figure illustrates three fundamental signal propagation and sensing mechanisms used in wireless sensing systems; (a) Human-Induced Multipath Effects: Movement of the human body causes dynamic changes in multipath propagation, leading to variations in signal paths and reflections that can be captured by sensing systems; (b) RFID System Operation Principle: The RFID reader emits an wireless signal that is reflected back by the passive tag via backscatter communication. The signal's changes encode positional or identity information; (c) FMCW Operation: Frequency-Modulated Continuous Wave (FMCW) radar transmits chirp signals with linearly varying frequencies. The time delay between transmitted and received signals induces a frequency shift, which can be used to estimate target distance and motion through range-Doppler analysis.
  • Figure 3: Overview of the WiTalk. Wireless signals are processed by a Wireless Encoder, while motion labels are enhanced by an LLMs Assistant and encoded into text embeddings. These embeddings are stored in JSON format, refined via an Attention Module, and fused with wireless features in the Sensing Model to perform downstream tasks.
  • Figure 4: Visualization examples of the performance of XRFMamba and WiFiTAD on the XRFV2 dataset, where results marked with +T denote those obtained after incorporating the text modality.