Table of Contents
Fetching ...

VoxelKeypointFusion: Generalizable Multi-View Multi-Person Pose Estimation

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

TL;DR

This work tackles the generalization challenge in multi-view, multi-person pose estimation by showing that many learned models fail to transfer to unseen datasets. It introduces VoxelKeypointFusion, a learning-free, bottom-up voxel fusion method with optional depth masking, and demonstrates strong cross-dataset generalization and competitive speed, along with the first multi-view whole-body pose estimator. To support cross-dataset research, it also presents Skelda, a lightweight library for dataset handling and evaluation. Depth masking further improves robustness by reducing invalid predictions, and the whole-body extension broadens applicability to rich action and gesture analysis. The results advocate for principled, generalizable voxel-based fusion in real-world, cross-scene deployments.

Abstract

In the rapidly evolving field of computer vision, the task of accurately estimating the poses of multiple individuals from various viewpoints presents a formidable challenge, especially if the estimations should be reliable as well. This work presents an extensive evaluation of the generalization capabilities of multi-view multi-person pose estimators to unseen datasets and presents a new algorithm with strong performance in this task. It also studies the improvements by additionally using depth information. Since the new approach can not only generalize well to unseen datasets, but also to different keypoints, the first multi-view multi-person whole-body estimator is presented. To support further research on those topics, all of the work is publicly accessible.

VoxelKeypointFusion: Generalizable Multi-View Multi-Person Pose Estimation

TL;DR

This work tackles the generalization challenge in multi-view, multi-person pose estimation by showing that many learned models fail to transfer to unseen datasets. It introduces VoxelKeypointFusion, a learning-free, bottom-up voxel fusion method with optional depth masking, and demonstrates strong cross-dataset generalization and competitive speed, along with the first multi-view whole-body pose estimator. To support cross-dataset research, it also presents Skelda, a lightweight library for dataset handling and evaluation. Depth masking further improves robustness by reducing invalid predictions, and the whole-body extension broadens applicability to rich action and gesture analysis. The results advocate for principled, generalizable voxel-based fusion in real-world, cross-scene deployments.

Abstract

In the rapidly evolving field of computer vision, the task of accurately estimating the poses of multiple individuals from various viewpoints presents a formidable challenge, especially if the estimations should be reliable as well. This work presents an extensive evaluation of the generalization capabilities of multi-view multi-person pose estimators to unseen datasets and presents a new algorithm with strong performance in this task. It also studies the improvements by additionally using depth information. Since the new approach can not only generalize well to unseen datasets, but also to different keypoints, the first multi-view multi-person whole-body estimator is presented. To support further research on those topics, all of the work is publicly accessible.

Paper Structure

This paper contains 12 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Example of a multi-person whole-body pose estimation from multiple camera views (from the panoptic dataset joo2015panoptic) with VoxelKeypointFusion, on the left the per-image 2D estimations, the right shows their fused 3D poses.
  • Figure 2: Obtaining keypoints and person ids. First, the heatmaps of each view (a, schema for two cameras, the 3D-skeletons are drawn only for better visualization) are projected as beams into the voxelized room (b, five real cameras, only a single hip joint visualized). Then peaks at overlapping beam positions are searched and retrieved as keypoint proposals. The peak proposals, here visualized as pink crosses, are projected into each image (c, with the two cameras from a, for better visualization they do not match to overlaps in b or a). The person association is then gathered from the person-id images (c, color-coded and color-paired only for better visualization, so every pixel with a red color has id=$1$). The person-ids of these points are extracted, in this case, one proposal has ids {$3$,$5$}, and the other two received no ids and thus are discarded. The same is done with the proposals from different joint types (so there could be a right shoulder with ids {$3$,$5$}). All proposals with the same id-set are then collected into a person group, which might be merged with other overlapping groups (for example if the left shoulder only got {$3$}).
  • Figure 3: Example of depth masking in VoxelKeypointFusion. In the left image, the keypoint heatmap beam projection of four persons from five camera views into a voxelized room is shown, with one color for each joint type. This is the default input for the peak proposal calculation. The center shows the voxelized depth images, in which three persons (the fourth is walking through the entrance) and the room's wall are clearly visible. On the right side, the projection was masked with the depth voxels. The three persons in the center of the room are now clearly visible in the projection space as well, and no peaks can be proposed between the persons anymore. Parts of the sphere's wall are still left, but since the keypoint projections do not overlap there, no peak proposals are generated from those.