Table of Contents
Fetching ...

RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

TL;DR

RapidPoseTriangulation presents a fast, learning-free approach to multi-view, multi-person whole-body pose estimation by performing pairwise 2D-to-3D triangulation of detected joints, followed by 3D-space merging and optional tracking. The method emphasizes geometric consistency and early pruning to achieve real-time performance and strong generalization across unseen datasets without heavy training. It demonstrates competitive accuracy on standard benchmarks, supports full-body outputs (including hands and face), and significantly outpaces voxel- and learning-based baselines in speed. The authors provide public source code to encourage adoption and further advances in real-time multi-view pose analysis for human-robot interaction and other applications.

Abstract

The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.

RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond

TL;DR

RapidPoseTriangulation presents a fast, learning-free approach to multi-view, multi-person whole-body pose estimation by performing pairwise 2D-to-3D triangulation of detected joints, followed by 3D-space merging and optional tracking. The method emphasizes geometric consistency and early pruning to achieve real-time performance and strong generalization across unseen datasets without heavy training. It demonstrates competitive accuracy on standard benchmarks, supports full-body outputs (including hands and face), and significantly outpaces voxel- and learning-based baselines in speed. The authors provide public source code to encourage adoption and further advances in real-time multi-view pose analysis for human-robot interaction and other applications.

Abstract

The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.

Paper Structure

This paper contains 16 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Example of a multi-person whole-body pose estimation from multiple camera views in a volleyball game (from the egohumans dataset khirodkar2023ego). On top, the full image of one camera with the projections of all the detected poses, on the bottom-left a zoom-in on one player, and on the bottom-right her detected 3D pose.
  • Figure 2: Obtaining 3D proposals. The process starts with 2D detections for each view (a, images from chi3dfieraru2020three, note the small blue colored false positive). Then, in step (1), all possible pairs between the views are created. In this case this leads to six pairs from the two detections above with the three below. The core joints of all pairs are triangulated into 3D proposals (b, steps (3,4)). Then they are reprojected into the 2D views (c, step (6)), and a distance-based error to the original 2D poses (visualized in black in c) is calculated. As can be seen in the image, the green and pink proposals clearly do not match to their original 2D poses, and get a very high error. The yellow and light-blue poses resulted from the flipped (man with woman) pairs and also have a notable error. All pairs with errors above a threshold are dropped in step (8). Only the remaining red and dark-blue pairs with low enough errors are used for the further steps (9-12).