VoxelKeypointFusion: Generalizable Multi-View Multi-Person Pose Estimation
Daniel Bermuth, Alexander Poeppel, Wolfgang Reif
TL;DR
This work tackles the generalization challenge in multi-view, multi-person pose estimation by showing that many learned models fail to transfer to unseen datasets. It introduces VoxelKeypointFusion, a learning-free, bottom-up voxel fusion method with optional depth masking, and demonstrates strong cross-dataset generalization and competitive speed, along with the first multi-view whole-body pose estimator. To support cross-dataset research, it also presents Skelda, a lightweight library for dataset handling and evaluation. Depth masking further improves robustness by reducing invalid predictions, and the whole-body extension broadens applicability to rich action and gesture analysis. The results advocate for principled, generalizable voxel-based fusion in real-world, cross-scene deployments.
Abstract
In the rapidly evolving field of computer vision, the task of accurately estimating the poses of multiple individuals from various viewpoints presents a formidable challenge, especially if the estimations should be reliable as well. This work presents an extensive evaluation of the generalization capabilities of multi-view multi-person pose estimators to unseen datasets and presents a new algorithm with strong performance in this task. It also studies the improvements by additionally using depth information. Since the new approach can not only generalize well to unseen datasets, but also to different keypoints, the first multi-view multi-person whole-body estimator is presented. To support further research on those topics, all of the work is publicly accessible.
