Monocular Person Localization under Camera Ego-motion
Yu Zhan, Hanjing Ye, Hong Zhang
TL;DR
This work tackles monocular person localization when the host robot induces significant camera ego-motion. It introduces an optimization-based framework that represents a person as four collinear points and jointly estimates the camera attitude and the person’s 3D footprint by minimizing a weighted reprojection error on a normalized image plane, solved via nonlinear least squares with a robust loss. The method is integrated into a Robot Person Following system for real-time execution on a quadruped robot and is evaluated on public datasets plus a new RPF-Quadruped dataset, showing improvements over geometric-model and learning-based baselines under challenging ego-motion. The approach reduces reliance on odometry, enabling robust, frame-wise localization that supports stable long-term following in rough terrain, with a public dataset release to facilitate further research.
Abstract
Localizing a person from a moving monocular camera is critical for Human-Robot Interaction (HRI). To estimate the 3D human position from a 2D image, existing methods either depend on the geometric assumption of a fixed camera or use a position regression model trained on datasets containing little camera ego-motion. These methods are vulnerable to severe camera ego-motion, resulting in inaccurate person localization. We consider person localization as a part of a pose estimation problem. By representing a human with a four-point model, our method jointly estimates the 2D camera attitude and the person's 3D location through optimization. Evaluations on both public datasets and real robot experiments demonstrate our method outperforms baselines in person localization accuracy. Our method is further implemented into a person-following system and deployed on an agile quadruped robot.
