Table of Contents
Fetching ...

WiFi based Human Fall and Activity Recognition using Transformer based Encoder Decoder and Graph Neural Networks

Younggeol Cho, Elisa Motta, Olivia Nocentini, Marta Lagomarsino, Andrea Merello, Marco Crepaldi, Arash Ajoudani

TL;DR

This work tackles privacy concerns in fall detection by leveraging WiFi CSI to estimate human skeletons and perform action recognition without cameras. It introduces TED-Net, a Transformer-augmented encoder-decoder that derives 17 2D keypoints from CSI across three antennas, and a Directed Graph Neural Network (DGNN) that classifies actions using CSI-derived skeletons with frame-level granularity. Across two datasets—MM-Fi and a custom fall-focused collection—the authors demonstrate that TED-Net outperforms existing CSI-based pose estimators and that DGNN achieves near-RGB performance for fall detection, validating the privacy-preserving viability of WiFi-based sensing. The results highlight the practical potential for home-based, vision-free monitoring of elderly individuals, while acknowledging limitations in spatial resolution and environmental sensitivity that invite future work on 3D pose and multi-person scenarios.

Abstract

Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.

WiFi based Human Fall and Activity Recognition using Transformer based Encoder Decoder and Graph Neural Networks

TL;DR

This work tackles privacy concerns in fall detection by leveraging WiFi CSI to estimate human skeletons and perform action recognition without cameras. It introduces TED-Net, a Transformer-augmented encoder-decoder that derives 17 2D keypoints from CSI across three antennas, and a Directed Graph Neural Network (DGNN) that classifies actions using CSI-derived skeletons with frame-level granularity. Across two datasets—MM-Fi and a custom fall-focused collection—the authors demonstrate that TED-Net outperforms existing CSI-based pose estimators and that DGNN achieves near-RGB performance for fall detection, validating the privacy-preserving viability of WiFi-based sensing. The results highlight the practical potential for home-based, vision-free monitoring of elderly individuals, while acknowledging limitations in spatial resolution and environmental sensitivity that invite future work on 3D pose and multi-person scenarios.

Abstract

Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.

Paper Structure

This paper contains 15 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the CSI-driven Human Skeleton Pose Estimation and Action Recognition. The network consists of two main parts: (Part A) CSI-based human skeleton pose estimation (TED-Net) and (Part B) action recognition. CSI signals from three antennas pass through CNN encoders and a Transformer to estimate skeleton keypoints. Keypoints estimated by YOLO from RGB image supervise the training. Estimated poses feed into a DGNN for action recognition and fall detection. C: Concatenate, R: Reshape.
  • Figure 2: Experimental setup. The CSI signal was emitted from one transmitter and captured by three receivers. A camera was placed at the same location as the receivers. The subject performed actions on the mattress.
  • Figure 3: CSI-driven Human skeleton pose estimation result based on Ted-Net. This example illustrates two subjects performing the five actions, demonstrating accurate estimations for non-fall cases. Conversely, in fall scenarios, estimation errors are noticeable, particularly at the distal parts of the upper body.
  • Figure 4: Confusion matrices showing recognition results based on (a) RGB image-based and (b) CSI-driven skeletons. 0:stand, 1: walk, 2:squat, 3:fall.
  • Figure 5: Normalized Tracking Error Across Body Segments. The position of each body segment was calculated as the average of the keypoints that comprise that segment.