Table of Contents
Fetching ...

WiFi CSI Based Temporal Activity Detection via Dual Pyramid Network

Zhendong Liu, Le Zhang, Bing Li, Yingjie Zhou, Zhenghua Chen, Ce Zhu

TL;DR

This work tackles WiFi CSI-based temporal activity detection in untrimmed, multi-activity streams by introducing DPWiT, a dual-pyramid framework that learns frequency-aware representations through a Temporal Signal Semantic Encoder (TSSE) and a Local Sensitive Response Encoder (LSRE), fused via Cross-attention Pyramid Fusion. A novel Signed Mask-Attention mechanism within the TSSE emphasizes informative regions while suppressing noise, and the LSRE captures regional fluctuations in a learning-free manner, enabling robust localization without heavy supervision. The authors provide a real-world WiFi CSI dataset with 2,114 annotated activity instances across 553 untrimmed samples and demonstrate state-of-the-art performance against strong baselines, supported by comprehensive ablations and qualitative analyses. The approach advances wireless Temporal Activity Detection by effectively combining high- and low-frequency cues, achieving accurate timing and semantic understanding suitable for privacy-preserving real-world monitoring.

Abstract

We address the challenge of WiFi-based temporal activity detection and propose an efficient Dual Pyramid Network that integrates Temporal Signal Semantic Encoders and Local Sensitive Response Encoders. The Temporal Signal Semantic Encoder splits feature learning into high and low-frequency components, using a novel Signed Mask-Attention mechanism to emphasize important areas and downplay unimportant ones, with the features fused using ContraNorm. The Local Sensitive Response Encoder captures fluctuations without learning. These feature pyramids are then combined using a new cross-attention fusion mechanism. We also introduce a dataset with over 2,114 activity segments across 553 WiFi CSI samples, each lasting around 85 seconds. Extensive experiments show our method outperforms challenging baselines.

WiFi CSI Based Temporal Activity Detection via Dual Pyramid Network

TL;DR

This work tackles WiFi CSI-based temporal activity detection in untrimmed, multi-activity streams by introducing DPWiT, a dual-pyramid framework that learns frequency-aware representations through a Temporal Signal Semantic Encoder (TSSE) and a Local Sensitive Response Encoder (LSRE), fused via Cross-attention Pyramid Fusion. A novel Signed Mask-Attention mechanism within the TSSE emphasizes informative regions while suppressing noise, and the LSRE captures regional fluctuations in a learning-free manner, enabling robust localization without heavy supervision. The authors provide a real-world WiFi CSI dataset with 2,114 annotated activity instances across 553 untrimmed samples and demonstrate state-of-the-art performance against strong baselines, supported by comprehensive ablations and qualitative analyses. The approach advances wireless Temporal Activity Detection by effectively combining high- and low-frequency cues, achieving accurate timing and semantic understanding suitable for privacy-preserving real-world monitoring.

Abstract

We address the challenge of WiFi-based temporal activity detection and propose an efficient Dual Pyramid Network that integrates Temporal Signal Semantic Encoders and Local Sensitive Response Encoders. The Temporal Signal Semantic Encoder splits feature learning into high and low-frequency components, using a novel Signed Mask-Attention mechanism to emphasize important areas and downplay unimportant ones, with the features fused using ContraNorm. The Local Sensitive Response Encoder captures fluctuations without learning. These feature pyramids are then combined using a new cross-attention fusion mechanism. We also introduce a dataset with over 2,114 activity segments across 553 WiFi CSI samples, each lasting around 85 seconds. Extensive experiments show our method outperforms challenging baselines.

Paper Structure

This paper contains 28 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The variation of ubiquitous wireless signal caused by different human activities along the temporal axis.
  • Figure 2: Overview of our network. Given a raw signal input, we employ 3 CGR layers to project the signal and generate output feature through dual pyramid temporal context modeling. Finally, the features are processed by the prediction head to transform into TAD results.
  • Figure 3: Prediction Visualizations form different methods.
  • Figure 4: False Positive Profiling on wireless dataset, we use DPWiT as the model and compare our net with DyFADet with their best mAP results.