Table of Contents
Fetching ...

SimpleDepthPose: Fast and Reliable Human Pose Estimation with RGBD-Images

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

TL;DR

The paper presents SimpleDepthPose, a fast, training-free RGBD-based approach for multi-view, multi-person 3D pose estimation. It predicts 2D joints from RGB frames, computes per-joint depths from aligned depth images using a cross-shaped neighborhood with per-joint offsets, and transforms joints to world coordinates. 3D pose proposals are tracked across frames and merged by filtering outliers and averaging the top-k proposals, enabling robust multi-view fusion without neural refinement. Evaluations on MVOR and Panoptic show strong generalization and high detection rates with impressive speed, highlighting depth data as a key enabler for robustness in occluded scenes. The method’s simplicity and speed, along with public code, make it a practical option for real-time, multi-person 3D pose estimation where depth information is available.

Abstract

In the rapidly advancing domain of computer vision, accurately estimating the poses of multiple individuals from various viewpoints remains a significant challenge, especially when reliability is a key requirement. This paper introduces a novel algorithm that excels in multi-view, multi-person pose estimation by incorporating depth information. An extensive evaluation demonstrates that the proposed algorithm not only generalizes well to unseen datasets, and shows a fast runtime performance, but also is adaptable to different keypoints. To support further research, all of the work is publicly accessible.

SimpleDepthPose: Fast and Reliable Human Pose Estimation with RGBD-Images

TL;DR

The paper presents SimpleDepthPose, a fast, training-free RGBD-based approach for multi-view, multi-person 3D pose estimation. It predicts 2D joints from RGB frames, computes per-joint depths from aligned depth images using a cross-shaped neighborhood with per-joint offsets, and transforms joints to world coordinates. 3D pose proposals are tracked across frames and merged by filtering outliers and averaging the top-k proposals, enabling robust multi-view fusion without neural refinement. Evaluations on MVOR and Panoptic show strong generalization and high detection rates with impressive speed, highlighting depth data as a key enabler for robustness in occluded scenes. The method’s simplicity and speed, along with public code, make it a practical option for real-time, multi-person 3D pose estimation where depth information is available.

Abstract

In the rapidly advancing domain of computer vision, accurately estimating the poses of multiple individuals from various viewpoints remains a significant challenge, especially when reliability is a key requirement. This paper introduces a novel algorithm that excels in multi-view, multi-person pose estimation by incorporating depth information. An extensive evaluation demonstrates that the proposed algorithm not only generalizes well to unseen datasets, and shows a fast runtime performance, but also is adaptable to different keypoints. To support further research, all of the work is publicly accessible.

Paper Structure

This paper contains 7 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Example of a multi-person pose estimation from multiple camera views (from the panoptic dataset joo2015panoptic). Using the 2D pose detections from the color images (top), a distance to the cameras is extracted from the aligned depth images (center), and the resulting 3D poses of each view are filtered and merged into a final result (bottom).
  • Figure 2: Visualization of the cross-shape used to extract the depth value for each joint in a zoom-in of the depth image. All pixels inside the cross are used to calculate the median depth distance.
  • Figure 3: Example of the proposals for each view with some joint errors (top), a zoom-in on the per-view poses (center), and their fused result with the final joints (bottom).