Towards Robust and Realistic Human Pose Estimation via WiFi Signals
Yang Chen, Jingcai Guo, Song Guo, Jingren Zhou, Dacheng Tao
TL;DR
WiFi-based human pose estimation faces cross-domain distribution shifts and distorted skeletal topology. The authors propose DT-Pose, a two-phase framework with Domain-consistent Representation Learning using masked self-supervision and temporal contrastive learning plus uniformity regularization, and Topology-constrained Pose Decoding that fuses a task prompt with Graph Convolutional Networks and Transformers to enforce skeletal structure. The approach delivers state-of-the-art results on MM-Fi, WiPose, and Person-in-WiFi-3D for both 2D and 3D pose estimation and demonstrates strong cross-domain generalization, highlighting improved stability and realism of predicted poses. This work advances privacy-preserving, robust pose estimation in uncontrolled environments and provides a scalable pipeline for cross-domain transfer in non-visual sensing modalities.
Abstract
Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
