Multi-View Active Sensing for Human-Robot Interaction via Hierarchically Connected Tree
Yuanjiong Ying, Xian Huang, Wei Dong
TL;DR
The paper tackles the challenge of safely executing human-robot interaction (HRI) under occlusion and restricted fields of view by introducing a multi-view active vision system (MCAV). It fuses multi-source RGB-D data using a hierarchically connected tree of human body keyparts and keypoints, estimates 3D keypoint positions from depth slices, and extracts keypart point clouds with occlusion-resilient masks. Registration to a cylindrical human model is performed in a hierarchical, constraint-aware manner via ICP, ensuring anatomically plausible poses. Experiments show substantial gains in keypart recognition recall and obstacle avoidance, highlighting MCAV's effectiveness in expanding perceptual reach and enhancing safety in industrial HRI.
Abstract
Comprehensive perception of human beings is the prerequisite to ensure the safety of human-robot interaction. Currently, prevailing visual sensing approach typically involves a single static camera, resulting in a restricted and occluded field of view. In our work, we develop an active vision system using multiple cameras to dynamically capture multi-source RGB-D data. An integrated human sensing strategy based on a hierarchically connected tree structure is proposed to fuse localized visual information. Constituting the tree model are the nodes representing keypoints and the edges representing keyparts, which are consistently interconnected to preserve the structural constraints during multi-source fusion. Utilizing RGB-D data and HRNet, the 3D positions of keypoints are analytically estimated, and their presence is inferred through a sliding widow of confidence scores. Subsequently, the point clouds of reliable keyparts are extracted by drawing occlusion-resistant masks, enabling fine registration between data clouds and cylindrical model following the hierarchical order. Experimental results demonstrate that our method enhances keypart recognition recall from 69.20% to 90.10%, compared to employing a single static camera. Furthermore, in overcoming challenges related to localized and occluded perception, the robotic arm's obstacle avoidance capabilities are effectively improved.
