Table of Contents
Fetching ...

RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark

Yuan-Hao Ho, Jen-Hao Cheng, Sheng Yao Kuan, Zhongyu Jiang, Wenhao Chai, Hsiang-Wei Huang, Chih-Lung Lin, Jenq-Neng Hwang

TL;DR

RT-Pose tackles privacy-sensitive 3D human pose estimation by leveraging calibrated 4D radar tensors alongside LiDAR and RGB data. It introduces the RT-Pose dataset (72k frames, 240 sequences, 6 actions) and a single-stage baseline HRRadarPose that learns high-resolution features directly from the 4D tensor, achieving a mean pose error of $MPJPE=9.93$ cm and localization error of $MRPE=9.91$ cm on challenging scenes. The annotation workflow combines HRNet-based 2D poses, ZeDO-based 3D pose estimation, and LiDAR depth to produce accurate 3D skeletons with manual refinement. Overall, RT-Pose demonstrates that raw 4D radar tensors provide richer information than radar point clouds for robust 3D HPE in complex, real-world conditions, offering a valuable benchmark and a strong baseline for future radar-based HPE methods, with potential impact on privacy-preserving, through-wall, and occlusion-robust applications. The dataset uses a single radar module to capture vertical and horizontal cues, simplifying setup while preserving performance, with dimensions $64\times32\times128\times256$ along velocity and spatial axes.

Abstract

Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method more conducive to practical deployments. This paper presents a Radar Tensor-based human pose (RT-Pose) dataset and an open-source benchmarking framework. The RT-Pose dataset comprises 4D radar tensors, LiDAR point clouds, and RGB images, and is collected for a total of 72k frames across 240 sequences with six different complexity-level actions. The 4D radar tensor provides raw spatio-temporal information, differentiating it from other radar point cloud-based datasets. We develop an annotation process using RGB images and LiDAR point clouds to accurately label 3D human skeletons. In addition, we propose HRRadarPose, the first single-stage architecture that extracts the high-resolution representation of 4D radar tensors in 3D space to aid human keypoint estimation. HRRadarPose outperforms previous radar-based HPE work on the RT-Pose benchmark. The overall HRRadarPose performance on the RT-Pose dataset, as reflected in a mean per joint position error (MPJPE) of 9.91cm, indicates the persistent challenges in achieving accurate HPE in complex real-world scenarios. RT-Pose is available at https://huggingface.co/datasets/uwipl/RT-Pose.

RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark

TL;DR

RT-Pose tackles privacy-sensitive 3D human pose estimation by leveraging calibrated 4D radar tensors alongside LiDAR and RGB data. It introduces the RT-Pose dataset (72k frames, 240 sequences, 6 actions) and a single-stage baseline HRRadarPose that learns high-resolution features directly from the 4D tensor, achieving a mean pose error of cm and localization error of cm on challenging scenes. The annotation workflow combines HRNet-based 2D poses, ZeDO-based 3D pose estimation, and LiDAR depth to produce accurate 3D skeletons with manual refinement. Overall, RT-Pose demonstrates that raw 4D radar tensors provide richer information than radar point clouds for robust 3D HPE in complex, real-world conditions, offering a valuable benchmark and a strong baseline for future radar-based HPE methods, with potential impact on privacy-preserving, through-wall, and occlusion-robust applications. The dataset uses a single radar module to capture vertical and horizontal cues, simplifying setup while preserving performance, with dimensions along velocity and spatial axes.

Abstract

Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method more conducive to practical deployments. This paper presents a Radar Tensor-based human pose (RT-Pose) dataset and an open-source benchmarking framework. The RT-Pose dataset comprises 4D radar tensors, LiDAR point clouds, and RGB images, and is collected for a total of 72k frames across 240 sequences with six different complexity-level actions. The 4D radar tensor provides raw spatio-temporal information, differentiating it from other radar point cloud-based datasets. We develop an annotation process using RGB images and LiDAR point clouds to accurately label 3D human skeletons. In addition, we propose HRRadarPose, the first single-stage architecture that extracts the high-resolution representation of 4D radar tensors in 3D space to aid human keypoint estimation. HRRadarPose outperforms previous radar-based HPE work on the RT-Pose benchmark. The overall HRRadarPose performance on the RT-Pose dataset, as reflected in a mean per joint position error (MPJPE) of 9.91cm, indicates the persistent challenges in achieving accurate HPE in complex real-world scenarios. RT-Pose is available at https://huggingface.co/datasets/uwipl/RT-Pose.
Paper Structure (18 sections, 1 equation, 9 figures, 8 tables)

This paper contains 18 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Experimental hardware setup in an indoor environment for data collection.
  • Figure 2: Data distribution for RT-Pose dataset: (a) Activities; (b) Environmental conditions; (c) Occlusion conditions
  • Figure 3: Experimental instances across various indoor and outdoor conditions with diverse scenarios for data collection.
  • Figure 4: Radar signal processing flow. The green arrow line is for radar point cloud generation and the blue line is for 4D radar tensor generation.
  • Figure 5: Workflow of human localization and 3D pose ground truth annotations. Estimated 2D pose results, predicted by the pre-trained HRNet model, are denoted as $P_{2d}$. The initial setting pose derived from LiDAR point clouds is denoted as $P_{init}$. Both $P_{2d}$ and $P_{init}$ are inputs into ZeDO, an optimization-based pipeline for 3D pose estimation.
  • ...and 4 more figures