Table of Contents
Fetching ...

Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Yang Chen, Jingcai Guo, Song Guo, Jingren Zhou, Dacheng Tao

TL;DR

WiFi-based human pose estimation faces cross-domain distribution shifts and distorted skeletal topology. The authors propose DT-Pose, a two-phase framework with Domain-consistent Representation Learning using masked self-supervision and temporal contrastive learning plus uniformity regularization, and Topology-constrained Pose Decoding that fuses a task prompt with Graph Convolutional Networks and Transformers to enforce skeletal structure. The approach delivers state-of-the-art results on MM-Fi, WiPose, and Person-in-WiFi-3D for both 2D and 3D pose estimation and demonstrates strong cross-domain generalization, highlighting improved stability and realism of predicted poses. This work advances privacy-preserving, robust pose estimation in uncontrolled environments and provides a scalable pipeline for cross-domain transfer in non-visual sensing modalities.

Abstract

Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.

Towards Robust and Realistic Human Pose Estimation via WiFi Signals

TL;DR

WiFi-based human pose estimation faces cross-domain distribution shifts and distorted skeletal topology. The authors propose DT-Pose, a two-phase framework with Domain-consistent Representation Learning using masked self-supervision and temporal contrastive learning plus uniformity regularization, and Topology-constrained Pose Decoding that fuses a task prompt with Graph Convolutional Networks and Transformers to enforce skeletal structure. The approach delivers state-of-the-art results on MM-Fi, WiPose, and Person-in-WiFi-3D for both 2D and 3D pose estimation and demonstrates strong cross-domain generalization, highlighting improved stability and realism of predicted poses. This work advances privacy-preserving, robust pose estimation in uncontrolled environments and provides a scalable pipeline for cross-domain transfer in non-visual sensing modalities.

Abstract

Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
Paper Structure (20 sections, 9 equations, 9 figures, 10 tables)

This paper contains 20 sections, 9 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: (a) shows the pose coordinates distribution between the source and target domains. (b) represents the predictions of the MetaFi++ method zhou2023metafi++ and corresponding ground truth. (c) denotes the overview framework of our method.
  • Figure 2: The pipeline of our method, including the pre-training and pose decoding phases.
  • Figure 3: Original WiFi CSI signals and different masking strategies on the MM-Fi dataset.
  • Figure 4: Performance on the MM-Fi (Protocol 1 - Setting 1) dataset with different masking ratios.
  • Figure 5: t-SNE visualization of WiFi representations on the MM-Fi (P1-S1). (a) represents that we drop the temporal-consistent contrastive strategy in the pre-training phase. (b) denotes that we equip it. Each color corresponds to a distinct action category.
  • ...and 4 more figures