Table of Contents
Fetching ...

Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

Tianyi Zhang, Wengyu Zhang, Xulu Zhang, Jiaxin Wu, Xiao-Yong Wei, Jiannong Cao, Qing Li

TL;DR

This work tackles tag-free indoor human localization by reframing the problem as learning the means of distributions over body points rather than exact per-point correspondences. It introduces Mean of Means (MoM), an end-to-end encoder–decoder framework that uses large-scale sampling of mean estimators to ensure normality and stable learning, while a perspective-transform decoder regularizes the neural mapping. The approach achieves high accuracy (approximately 95% within $0.3\,\mathrm{m}$ and near 100% within $0.5\,\mathrm{m}$) at a low hardware cost with two $640\times480$ webcams (~$10$ USD) and demonstrates real-time performance (~$1{,}060$ samples/s). It outperforms traditional PnP+Triangulation and UWB baselines on a real indoor dataset of 34k samples, showing robustness to camera motion and noise, and enabling practical Metaverse-ready localization without calibration. The method offers a scalable, low-cost solution for accurate human localization that can be deployed with existing on-site cameras and supports interactive avatar applications.

Abstract

Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints. To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 95% within a 0.3m range and nearly 100% accuracy within a 0.5m range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640x480 pixels.

Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

TL;DR

This work tackles tag-free indoor human localization by reframing the problem as learning the means of distributions over body points rather than exact per-point correspondences. It introduces Mean of Means (MoM), an end-to-end encoder–decoder framework that uses large-scale sampling of mean estimators to ensure normality and stable learning, while a perspective-transform decoder regularizes the neural mapping. The approach achieves high accuracy (approximately 95% within and near 100% within ) at a low hardware cost with two webcams (~ USD) and demonstrates real-time performance (~ samples/s). It outperforms traditional PnP+Triangulation and UWB baselines on a real indoor dataset of 34k samples, showing robustness to camera motion and noise, and enabling practical Metaverse-ready localization without calibration. The method offers a scalable, low-cost solution for accurate human localization that can be deployed with existing on-site cameras and supports interactive avatar applications.

Abstract

Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints. To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 95% within a 0.3m range and nearly 100% accuracy within a 0.5m range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640x480 pixels.
Paper Structure (20 sections, 14 equations, 7 figures, 2 tables)

This paper contains 20 sections, 14 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Neural implementation of MoM with encoder-decoder collaboration.
  • Figure 2: Training and testing losses of MoM.
  • Figure 3: Distributions of different random variables in a real example: the camera captures the user standing in a fixed location and performing various actions repeatedly. The world and pixel coordinates of the keypoints on the body at different time points can then be used as the observations of variables.
  • Figure 4: The MoM performance over number of participants involved in training with the predicted trajectories compared with the ground truth (GT): as the number of participants increases, the model's performance evolves and stabilizes. The performance appears to reach a convergence point when using four participant's data for training.
  • Figure 5: The performance of the MoM method was evaluated under varying degrees of camera perturbation with the predicted trajectories compared with the ground truth (GT): no significant drops in performance within the range of 0.3m were observed until the maximum camera offsets exceeded 8 pixels.
  • ...and 2 more figures