Table of Contents
Fetching ...

Improving Real-Time Omnidirectional 3D Multi-Person Human Pose Estimation with People Matching and Unsupervised 2D-3D Lifting

Pawel Knap, Peter Hardy, Alberto Tamajo, Hwasup Lim, Hansung Kim

TL;DR

The paper tackles real-time 3D multi-person pose estimation in a 360° scene by fusing an omnidirectional camera with mmWave radars to overcome depth/scale ambiguities and occlusion. It presents a complete pipeline that combines OpenPose for 2D keypoints, radar-based localization to a global frame, and unsupervised 2D-3D lifting via the LInKs model, with careful camera–radar calibration and an improved cross-sensor matching module. Key contributions include calibration methods, a refined image–radar matching approach, and effective occlusion handling, achieving around 7–8 fps on a laptop GPU. The results demonstrate robust, scalable performance for indoor and outdoor environments, offering an affordable real-time solution for multi-person 3D pose estimation in challenging scenes.

Abstract

Current human pose estimation systems focus on retrieving an accurate 3D global estimate of a single person. Therefore, this paper presents one of the first 3D multi-person human pose estimation systems that is able to work in real-time and is also able to handle basic forms of occlusion. First, we adjust an off-the-shelf 2D detector and an unsupervised 2D-3D lifting model for use with a 360$^\circ$ panoramic camera and mmWave radar sensors. We then introduce several contributions, including camera and radar calibrations, and the improved matching of people within the image and radar space. The system addresses both the depth and scale ambiguity problems by employing a lightweight 2D-3D pose lifting algorithm that is able to work in real-time while exhibiting accurate performance in both indoor and outdoor environments which offers both an affordable and scalable solution. Notably, our system's time complexity remains nearly constant irrespective of the number of detected individuals, achieving a frame rate of approximately 7-8 fps on a laptop with a commercial-grade GPU.

Improving Real-Time Omnidirectional 3D Multi-Person Human Pose Estimation with People Matching and Unsupervised 2D-3D Lifting

TL;DR

The paper tackles real-time 3D multi-person pose estimation in a 360° scene by fusing an omnidirectional camera with mmWave radars to overcome depth/scale ambiguities and occlusion. It presents a complete pipeline that combines OpenPose for 2D keypoints, radar-based localization to a global frame, and unsupervised 2D-3D lifting via the LInKs model, with careful camera–radar calibration and an improved cross-sensor matching module. Key contributions include calibration methods, a refined image–radar matching approach, and effective occlusion handling, achieving around 7–8 fps on a laptop GPU. The results demonstrate robust, scalable performance for indoor and outdoor environments, offering an affordable real-time solution for multi-person 3D pose estimation in challenging scenes.

Abstract

Current human pose estimation systems focus on retrieving an accurate 3D global estimate of a single person. Therefore, this paper presents one of the first 3D multi-person human pose estimation systems that is able to work in real-time and is also able to handle basic forms of occlusion. First, we adjust an off-the-shelf 2D detector and an unsupervised 2D-3D lifting model for use with a 360 panoramic camera and mmWave radar sensors. We then introduce several contributions, including camera and radar calibrations, and the improved matching of people within the image and radar space. The system addresses both the depth and scale ambiguity problems by employing a lightweight 2D-3D pose lifting algorithm that is able to work in real-time while exhibiting accurate performance in both indoor and outdoor environments which offers both an affordable and scalable solution. Notably, our system's time complexity remains nearly constant irrespective of the number of detected individuals, achieving a frame rate of approximately 7-8 fps on a laptop with a commercial-grade GPU.
Paper Structure (10 sections, 1 equation, 4 figures, 3 tables)

This paper contains 10 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Our experimental setup consists of a laptop with RTX 3060, the Ricoh Theta V omindirectional camera and three TI AWR1843BOOST mmWava radars.
  • Figure 2: Overview of our approach. We use the video from an omnidirectional camera to obtain 2D body keypoints in the image space. Simultaneously we use 3 radar sensors to localise each person in our global 3D coordinate system. We then match these detected 2D poses to our radar's depth estimate. Next, these 2D poses are lifted into 3D and to finalise we transform their predicted 3D coordinates to be within our global coordinate system.
  • Figure 3: Qualtative results of our approach. The above images show the input frame to our model with poses captured by OpenPose. The bottom images show the corresponding reconstructed 3D poses in our global 3D coordinate system. All pictures are partially cropped around the top and bottom.
  • Figure 4: Showing the localisation error in metres at various points around our setup. The errors were evaluated in each radar's $\hat{x}$ (left) and $\hat{z}$ (right) directions. The figures represent these errors in the $(\mathbf{\hat{X}},\mathbf{\hat{Z}})$ 2D global coordinate system. The red dot marks the system location.