Table of Contents
Fetching ...

A Robust Filter for Marker-less Multi-person Tracking in Human-Robot Interaction Scenarios

Enrico Martini, Harshil Parekh, Shaoting Peng, Nicola Bombieri, Nadia Figueroa

TL;DR

This work introduces a plug-and-play, three-node filter to refine incomplete 3D human poses from a single RGB-D camera, addressing occlusions and multi-person interactions in marker-less HRI. The spatial, temporal, and permanence stages compute confidence, track identities with a robust Hungarian assignment, and apply an OPF-like filter with dynamically adjusted noise to maintain smooth trajectories. Empirical results across four tasks show substantial improvements in MAE, STD, and ACC over baselines, along with higher perceived safety and near ground-truth end-effector stability. The approach is practical for real-time deployment and reduces robot jitter, enabling safer, more natural human-robot collaboration. Key equations include the cost blending in the temporal module $M_{i,j}=D_{i,j}+C_{i,j}+u(D_{i,j}+C_{i,j}-\delta)$ and the occlusion-aware covariance $R_{i,k}=\alpha^{c_{i,k}-\beta}$, illustrating how spatial and temporal cues are fused to stabilize multi-person 3D pose estimates.

Abstract

Pursuing natural and marker-less human-robot interaction (HRI) has been a long-standing robotics research focus, driven by the vision of seamless collaboration without physical markers. Marker-less approaches promise an improved user experience, but state-of-the-art struggles with the challenges posed by intrinsic errors in human pose estimation (HPE) and depth cameras. These errors can lead to issues such as robot jittering, which can significantly impact the trust users have in collaborative systems. We propose a filtering pipeline that refines incomplete 3D human poses from an HPE backbone and a single RGB-D camera to address these challenges, solving for occlusions that can degrade the interaction. Experimental results show that using the proposed filter leads to more consistent and noise-free motion representation, reducing unexpected robot movements and enabling smoother interaction.

A Robust Filter for Marker-less Multi-person Tracking in Human-Robot Interaction Scenarios

TL;DR

This work introduces a plug-and-play, three-node filter to refine incomplete 3D human poses from a single RGB-D camera, addressing occlusions and multi-person interactions in marker-less HRI. The spatial, temporal, and permanence stages compute confidence, track identities with a robust Hungarian assignment, and apply an OPF-like filter with dynamically adjusted noise to maintain smooth trajectories. Empirical results across four tasks show substantial improvements in MAE, STD, and ACC over baselines, along with higher perceived safety and near ground-truth end-effector stability. The approach is practical for real-time deployment and reduces robot jitter, enabling safer, more natural human-robot collaboration. Key equations include the cost blending in the temporal module and the occlusion-aware covariance , illustrating how spatial and temporal cues are fused to stabilize multi-person 3D pose estimates.

Abstract

Pursuing natural and marker-less human-robot interaction (HRI) has been a long-standing robotics research focus, driven by the vision of seamless collaboration without physical markers. Marker-less approaches promise an improved user experience, but state-of-the-art struggles with the challenges posed by intrinsic errors in human pose estimation (HPE) and depth cameras. These errors can lead to issues such as robot jittering, which can significantly impact the trust users have in collaborative systems. We propose a filtering pipeline that refines incomplete 3D human poses from an HPE backbone and a single RGB-D camera to address these challenges, solving for occlusions that can degrade the interaction. Experimental results show that using the proposed filter leads to more consistent and noise-free motion representation, reducing unexpected robot movements and enabling smoother interaction.
Paper Structure (13 sections, 8 equations, 3 figures, 3 tables)

This paper contains 13 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The proposed filter overview: a single camera sends the stream of a multi-person HRI scenario to a generic HPE backbone that outputs a set of 3D poses. The spatial evaluation node adjusts the initial prediction confidence based on the spatial relationship between keypoints. The temporal tracking and evaluation node assigns a unique label to each person in the scene, consistent across frames. The permanence filter uses previous trajectories to detect occlusions in the scene. This results in a refined set of skeletons fed into a tracking target node that provides the 3D position goal for the velocity controller.
  • Figure 2: Visual representations of the four tasks.
  • Figure 3: 2D projection on the XY axis of the task T0. In green is the wrist position captured by the ground truth, in black the compared marker-less methods: OpenPose cao2017realtime without filters (a), OpenPose filtered by Kalman filter of the first order (b), second order (c) and our filter (d).