Table of Contents
Fetching ...

Sixth-Sense: Self-Supervised Learning of Spatial Awareness of Humans from a Planar Lidar

Simone Arreghini, Nicholas Carlotti, Mirko Nava, Antonio Paolillo, Alessandro Giusti

TL;DR

This work tackles omnidirectional human detection and 2D pose estimation for service robots using inexpensive 1D LiDAR by adopting a self-supervised paradigm that leverages an RGB-D camera as supervision. A 1D FCN with a $43^{\circ}$ receptive field processes a time window of LiDAR scans to predict per-ray presence, distance, and bearing (via sine/cosine), trained with a masked loss only over the camera's field of view. Across three environments, the approach achieves meaningful detection and pose estimates with a $P_{80}$ of about $70.6\%$ and an orientation error near $44^{\circ}$, while enabling omnidirectional reasoning not possible with camera FOV alone. The method demonstrates practical applicability on a TIAGo platform with fused LiDARs and an Azure Kinect, offering open-source tools and datasets to advance human-robot interaction in resource-constrained robotics.

Abstract

Localizing humans is a key prerequisite for any service robot operating in proximity to people. In these scenarios, robots rely on a multitude of state-of-the-art detectors usually designed to operate with RGB-D cameras or expensive 3D LiDARs. However, most commercially available service robots are equipped with cameras with a narrow field of view, making them blind when a user is approaching from other directions, or inexpensive 1D LiDARs whose readings are difficult to interpret. To address these limitations, we propose a self-supervised approach to detect humans and estimate their 2D pose from 1D LiDAR data, using detections from an RGB-D camera as a supervision source. Our approach aims to provide service robots with spatial awareness of nearby humans. After training on 70 minutes of data autonomously collected in two environments, our model is capable of detecting humans omnidirectionally from 1D LiDAR data in a novel environment, with 71% precision and 80% recall, while retaining an average absolute error of 13 cm in distance and 44° in orientation.

Sixth-Sense: Self-Supervised Learning of Spatial Awareness of Humans from a Planar Lidar

TL;DR

This work tackles omnidirectional human detection and 2D pose estimation for service robots using inexpensive 1D LiDAR by adopting a self-supervised paradigm that leverages an RGB-D camera as supervision. A 1D FCN with a receptive field processes a time window of LiDAR scans to predict per-ray presence, distance, and bearing (via sine/cosine), trained with a masked loss only over the camera's field of view. Across three environments, the approach achieves meaningful detection and pose estimates with a of about and an orientation error near , while enabling omnidirectional reasoning not possible with camera FOV alone. The method demonstrates practical applicability on a TIAGo platform with fused LiDARs and an Azure Kinect, offering open-source tools and datasets to advance human-robot interaction in resource-constrained robotics.

Abstract

Localizing humans is a key prerequisite for any service robot operating in proximity to people. In these scenarios, robots rely on a multitude of state-of-the-art detectors usually designed to operate with RGB-D cameras or expensive 3D LiDARs. However, most commercially available service robots are equipped with cameras with a narrow field of view, making them blind when a user is approaching from other directions, or inexpensive 1D LiDARs whose readings are difficult to interpret. To address these limitations, we propose a self-supervised approach to detect humans and estimate their 2D pose from 1D LiDAR data, using detections from an RGB-D camera as a supervision source. Our approach aims to provide service robots with spatial awareness of nearby humans. After training on 70 minutes of data autonomously collected in two environments, our model is capable of detecting humans omnidirectionally from 1D LiDAR data in a novel environment, with 71% precision and 80% recall, while retaining an average absolute error of 13 cm in distance and 44° in orientation.

Paper Structure

This paper contains 8 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our approach uses a human detector from the narrow FOV Azure Kinect as a source of labels to train a 1D fcn that, given planar LiDAR scans, predicts the presence and relative 2D pose of humans around the robot. The training approach relies only on hardware onboard the robot and can autonomously collect data in any environment. In this environment (Lab), a Motion Capture system collects ground truth used for evaluation purposes.
  • Figure 2: Walking people and static structures as perceived by the LiDAR: lighter shades of green indicate older scans in the temporal window, whereas black arrows indicate the people's instantaneous orientation.
  • Figure 3: Training data is autonomously collected by the robot in a University Corridor, on the top, and a Break Area, on the bottom.
  • Figure 4: Our model uses a temporal window of $n$ LiDAR scans to predict the presence $p$ of nearby people, their distance $d$, and relative bearing $o$ (represented by sine and cosine). Dilated circular convolutions handle omnidirectional scans and yield a $43^{\circ}$receptive field. A masked MSE loss is only enforced on predictions that overlap with the camera fov (red shaded area).
  • Figure 5: Left: Precision-Recall curve for detection. Right: relative bearing error distribution. Results computed against mocap ground truth in the Lab test set.
  • ...and 1 more figures