Table of Contents
Fetching ...

RAPTR: Radar-based 3D Pose Estimation using Transformer

Sorachi Kato, Ryoma Yataka, Pu Perry Wang, Pedro Miraldo, Takuya Fujihashi, Petros Boufounos

TL;DR

RAPTR tackles radar-based indoor 3D human pose estimation under weak supervision by leveraging only coarse 3D BBoxes and 2D keypoints. It introduces a two-stage Transformer decoder with pseudo-3D deformable attention to fuse multi-view radar features and address depth ambiguities, guided by a structured loss comprising a 3D Template Loss and a combination of 3D Gravity and 2D Keypoint losses. The method achieves state-of-the-art results on HIBER and MMVR, delivering large reductions in MPJPE and center distance while using cheaper supervisory signals. This approach demonstrates scalable view association and preserves plausible 3D body structures without dense 3D keypoint annotations, highlighting its practicality for privacy-preserving indoor radar perception.

Abstract

Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3\%$ on HIBER and $76.9\%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.

RAPTR: Radar-based 3D Pose Estimation using Transformer

TL;DR

RAPTR tackles radar-based indoor 3D human pose estimation under weak supervision by leveraging only coarse 3D BBoxes and 2D keypoints. It introduces a two-stage Transformer decoder with pseudo-3D deformable attention to fuse multi-view radar features and address depth ambiguities, guided by a structured loss comprising a 3D Template Loss and a combination of 3D Gravity and 2D Keypoint losses. The method achieves state-of-the-art results on HIBER and MMVR, delivering large reductions in MPJPE and center distance while using cheaper supervisory signals. This approach demonstrates scalable view association and preserves plausible 3D body structures without dense 3D keypoint annotations, highlighting its practicality for privacy-preserving indoor radar perception.

Abstract

Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by on HIBER and on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.

Paper Structure

This paper contains 63 sections, 23 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: RAPTR takes multi-view radar heatmaps as inputs and performs a novel Pseudo-3D deformable attention between (pose and joint) queries and multi-view radar features in a two-stage decoder to estimate 3D human poses in a 3D coordinate system. Rather than relying on expensive, environment‐specific fine‐grained 3D keypoint labels, RAPTR makes use of cheaper, more scalable labels such as coarse-grained 3D BBoxes and fine-grained 2D keypoints to train the model.
  • Figure 2: Multi-view radar heatmaps.
  • Figure 3: The RAPTR architecture consists of: 1) Cross-view Encoder that extracts multi-scale radar features; 2) Pseudo-3D Pose Decoder that enhances pose queries via a pseudo-3D deformable attention and predicts initial 3D poses; and 3) Pseudo-3D Joint Decoder that further refines joint queries and outputs final 3D poses. In terms of loss function, RAPTR leverages 3D BBox and 2D keypoint labels through coarse-grained 3D loss (gravity and template) and 2D keypoint loss.
  • Figure 4: The pseudo-3D deformable attention operates on a 3D reference point and 3D sampling offsets that are projected to different radar views for pseudo-3D attention between multi-view radar features and the query.
  • Figure 5: Visualization of 3D pose estimation by RAPTR and baseline methods on the HIBER dataset.
  • ...and 14 more figures