RAPTR: Radar-based 3D Pose Estimation using Transformer
Sorachi Kato, Ryoma Yataka, Pu Perry Wang, Pedro Miraldo, Takuya Fujihashi, Petros Boufounos
TL;DR
RAPTR tackles radar-based indoor 3D human pose estimation under weak supervision by leveraging only coarse 3D BBoxes and 2D keypoints. It introduces a two-stage Transformer decoder with pseudo-3D deformable attention to fuse multi-view radar features and address depth ambiguities, guided by a structured loss comprising a 3D Template Loss and a combination of 3D Gravity and 2D Keypoint losses. The method achieves state-of-the-art results on HIBER and MMVR, delivering large reductions in MPJPE and center distance while using cheaper supervisory signals. This approach demonstrates scalable view association and preserves plausible 3D body structures without dense 3D keypoint annotations, highlighting its practicality for privacy-preserving indoor radar perception.
Abstract
Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3\%$ on HIBER and $76.9\%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.
