Table of Contents
Fetching ...

The Un-Kidnappable Robot: Acoustic Localization of Sneaking People

Mengyu Yang, Patrick Grady, Samarth Brahmbhatt, Arun Balajee Vasudevan, Charles C. Kemp, James Hays

TL;DR

The study tackles the safety-critical problem of detecting and localizing people around robots using only incidental, passive sounds produced by moving individuals. It introduces the Robot Kidnapper dataset, a synchronized collection of 4-channel audio and 360° RGB video, and trains a multi-task model to simultaneously estimate azimuth and radial distance while detecting moving presence, all from audio alone. Key contributions include a public, diverse dataset, a robust audio-only localization model outperforming acoustic baselines, and a real-robot demonstration on a Stretch RE-1 showing real-time robotic awareness without active sensing. The work demonstrates the viability of passive audio sensing for robust human awareness in robotics, offering a fallback mechanism when visual or other sensors fail and enabling safer human-robot interaction in everyday environments.

Abstract

How easy is it to sneak up on a robot? We examine whether we can detect people using only the incidental sounds they produce as they move, even when they try to be quiet. We collect a robotic dataset of high-quality 4-channel audio paired with 360 degree RGB data of people moving in different indoor settings. We train models that predict if there is a moving person nearby and their location using only audio. We implement our method on a robot, allowing it to track a single person moving quietly with only passive audio sensing. For demonstration videos, see our project page: https://sites.google.com/view/unkidnappable-robot

The Un-Kidnappable Robot: Acoustic Localization of Sneaking People

TL;DR

The study tackles the safety-critical problem of detecting and localizing people around robots using only incidental, passive sounds produced by moving individuals. It introduces the Robot Kidnapper dataset, a synchronized collection of 4-channel audio and 360° RGB video, and trains a multi-task model to simultaneously estimate azimuth and radial distance while detecting moving presence, all from audio alone. Key contributions include a public, diverse dataset, a robust audio-only localization model outperforming acoustic baselines, and a real-robot demonstration on a Stretch RE-1 showing real-time robotic awareness without active sensing. The work demonstrates the viability of passive audio sensing for robust human awareness in robotics, offering a fallback mechanism when visual or other sensors fail and enabling safer human-robot interaction in everyday environments.

Abstract

How easy is it to sneak up on a robot? We examine whether we can detect people using only the incidental sounds they produce as they move, even when they try to be quiet. We collect a robotic dataset of high-quality 4-channel audio paired with 360 degree RGB data of people moving in different indoor settings. We train models that predict if there is a moving person nearby and their location using only audio. We implement our method on a robot, allowing it to track a single person moving quietly with only passive audio sensing. For demonstration videos, see our project page: https://sites.google.com/view/unkidnappable-robot
Paper Structure (21 sections, 1 equation, 6 figures, 2 tables)

This paper contains 21 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Can we detect where people are based only on the subtle sounds they incidentally produce when they move, even when they try to be quiet? We collect a dataset of high-quality audio paired with 360° RGB data with different participants in multiple indoor scenes. We train models to localize a moving person based on audio only and implement it on a robot.
  • Figure 2: Frames from the Robot Kidnapper dataset (static robot). The participant wears a hat with ArUco markers garrido2014automatic used to calculate ground truth radial distance. The RGB frames are used to calculate the ground truth centroid of the person using DeepLabv3+ chen2018encoder. Only the audio is used during training. The vertical red lines are the angles predicted by our model in an unseen room. The participant is walking normally in these frames.
  • Figure 3: (a) Dataset capture setup. (b) Distribution of radial distances between the robot and person in the dataset.
  • Figure 4: Diagram of our model architecture. We perform background subtraction (Sec. \ref{['sec:back sub']}) on input spectrograms before passing them through a spectrogram encoder with shared weights. The resulting features are concatenated and passed through the feature encoder based on the ASPP module chen2018encoder. The output is fed to 4 linear layer heads for the prediction tasks.
  • Figure 5: Log spectrograms for all categories along with regular talking. No talking is used in our work, but we show the spectrogram as reference for a common sound source used in localization. All recordings were taken in the same room during the same recording session.
  • ...and 1 more figures