Table of Contents
Fetching ...

Diffusion Model is a Good Pose Estimator from 3D RF-Vision

Junqiao Fan, Jianfei Yang, Yuecong Xu, Lihua Xie

TL;DR

This work tackles 3D human pose estimation from privacy-preserving mmWave RF-vision, where radar point clouds are sparse and noisy, causing miss-detections and unstable poses. It introduces mmDiff, a diffusion-based pose estimator that conditions the reverse diffusion on radar-derived cues, including Global Radar Context (GRC), Local Radar Context (LRC), Structural Limb-Length Consistency (SLC), and Temporal Motion Consistency (TMC). The method uses a two-phase learning objective: first, robust joint-feature extraction and coarse pose estimation; second, diffusion-based refinement guided by the radar-conditioned cues with a limb-length regression term. Experiments on mmBody and mm-Fi show state-of-the-art accuracy and enhanced pose stability under adverse conditions, demonstrating the practicality of diffusion-guided RF-vision HPE for privacy-preserving sensing in robotics and edge deployments.

Abstract

Human pose estimation (HPE) from Radio Frequency vision (RF-vision) performs human sensing using RF signals that penetrate obstacles without revealing privacy (e.g., facial information). Recently, mmWave radar has emerged as a promising RF-vision sensor, providing radar point clouds by processing RF signals. However, the mmWave radar has a limited resolution with severe noise, leading to inaccurate and inconsistent human pose estimation. This work proposes mmDiff, a novel diffusion-based pose estimator tailored for noisy radar data. Our approach aims to provide reliable guidance as conditions to diffusion models. Two key challenges are addressed by mmDiff: (1) miss-detection of parts of human bodies, which is addressed by a module that isolates feature extraction from different body parts, and (2) signal inconsistency due to environmental interference, which is tackled by incorporating prior knowledge of body structure and motion. Several modules are designed to achieve these goals, whose features work as the conditions for the subsequent diffusion model, eliminating the miss-detection and instability of HPE based on RF-vision. Extensive experiments demonstrate that mmDiff outperforms existing methods significantly, achieving state-of-the-art performances on public datasets.

Diffusion Model is a Good Pose Estimator from 3D RF-Vision

TL;DR

This work tackles 3D human pose estimation from privacy-preserving mmWave RF-vision, where radar point clouds are sparse and noisy, causing miss-detections and unstable poses. It introduces mmDiff, a diffusion-based pose estimator that conditions the reverse diffusion on radar-derived cues, including Global Radar Context (GRC), Local Radar Context (LRC), Structural Limb-Length Consistency (SLC), and Temporal Motion Consistency (TMC). The method uses a two-phase learning objective: first, robust joint-feature extraction and coarse pose estimation; second, diffusion-based refinement guided by the radar-conditioned cues with a limb-length regression term. Experiments on mmBody and mm-Fi show state-of-the-art accuracy and enhanced pose stability under adverse conditions, demonstrating the practicality of diffusion-guided RF-vision HPE for privacy-preserving sensing in robotics and edge deployments.

Abstract

Human pose estimation (HPE) from Radio Frequency vision (RF-vision) performs human sensing using RF signals that penetrate obstacles without revealing privacy (e.g., facial information). Recently, mmWave radar has emerged as a promising RF-vision sensor, providing radar point clouds by processing RF signals. However, the mmWave radar has a limited resolution with severe noise, leading to inaccurate and inconsistent human pose estimation. This work proposes mmDiff, a novel diffusion-based pose estimator tailored for noisy radar data. Our approach aims to provide reliable guidance as conditions to diffusion models. Two key challenges are addressed by mmDiff: (1) miss-detection of parts of human bodies, which is addressed by a module that isolates feature extraction from different body parts, and (2) signal inconsistency due to environmental interference, which is tackled by incorporating prior knowledge of body structure and motion. Several modules are designed to achieve these goals, whose features work as the conditions for the subsequent diffusion model, eliminating the miss-detection and instability of HPE based on RF-vision. Extensive experiments demonstrate that mmDiff outperforms existing methods significantly, achieving state-of-the-art performances on public datasets.
Paper Structure (30 sections, 10 equations, 5 figures, 5 tables)

This paper contains 30 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Left: challenges of mmWave PCs. Right: the performance of existing SOTA (P4Transformer fan2021point) compared to ours. The GTs are black and predictions are colored. PC's sparsity and dispersion cause inaccurate spline and shoulder. Inconsistent PCs with occasional miss-detection further cause size variance and pose vibration. mmDiff proposes diffusion-based pose estimation with enhanced accuracy and stability.
  • Figure 2: mmDiff proposes a diffusion-based HPE model, using mmWave radar information as conditions. $k \in [0..K]$ denotes the diffusion step. Four modules are proposed as the more reliable guidance, addressing PCs' noise and inconsistency: GRC and LRC first extract robust global-local radar features, $C^{glo}$ and $C^{loc}$; SLC and TMC then extract consistent human structure and motion patterns, $C^{tem}$ and $C^{lim}$.
  • Figure 3: Qualitative visualization of the estimated poses on mmBody dataset. mmDiff demonstrates higher keypoint accuracy. The GTs are black and predictions are colored.
  • Figure 4: (a) shows the pose motion stability on mm-Fi by plotting 5 consecutive frames of poses. mmDiff shows more consistent motion patterns (zoom in for details). (b) shows the motion energy levels on mmBody, where lower AKV indicates better stability.
  • Figure 5: Limb-length distribution for a single subject by histogram. The error within 5 cm can be treated as correct. With $C^{lim}$, more accurate limb-length and less variance are observed, as the distribution moves towards the GT and is more concentrated.