Table of Contents
Fetching ...

SonicBoom: Contact Localization Using Array of Microphones

Moonyoung Lee, Uksang Yoo, Jean Oh, Jeffrey Ichnowski, George Kantor, Oliver Kroemer

TL;DR

SonicBoom addresses contact localization during collisions in occluded, cluttered environments by embedding six contact microphones along a robot end-effector and learning a mapping from vibrotactile signals and proprioception to the contact location on a cylindrical surface. The approach fuses mel spectrograms, GCC-PHAT features, and trajectory data through a multi-modal transformer, trained on a large dataset of $108{,}000$ audio files from $18{,}000$ interactions. It achieves high-precision localization with MED from $0.42\mathrm{cm}$ in-distribution to $2.22\mathrm{cm}$ under out-of-distribution conditions and demonstrates practical haptic mapping in mock canopies, including zero-shot transfer with audio-only inputs. The work highlights the viability and generalization of acoustic sensing for tactile localization in visually challenging outdoor robotics tasks, suggesting avenues for continuous tracking and multi-contact estimation in future work.

Abstract

In cluttered environments where visual sensors encounter heavy occlusion, such as in agricultural settings, tactile signals can provide crucial spatial information for the robot to locate rigid objects and maneuver around them. We introduce SonicBoom, a holistic hardware and learning pipeline that enables contact localization through an array of contact microphones. While conventional sound source localization methods effectively triangulate sources in air, localization through solid media with irregular geometry and structure presents challenges that are difficult to model analytically. We address this challenge through a feature engineering and learning based approach, autonomously collecting 18,000 robot interaction sound pairs to learn a mapping between acoustic signals and collision locations on the robot end effector link. By leveraging relative features between microphones, SonicBoom achieves localization errors of 0.42cm for in distribution interactions and maintains robust performance of 2.22cm error even with novel objects and contact conditions. We demonstrate the system's practical utility through haptic mapping of occluded branches in mock canopy settings, showing that acoustic based sensing can enable reliable robot navigation in visually challenging environments.

SonicBoom: Contact Localization Using Array of Microphones

TL;DR

SonicBoom addresses contact localization during collisions in occluded, cluttered environments by embedding six contact microphones along a robot end-effector and learning a mapping from vibrotactile signals and proprioception to the contact location on a cylindrical surface. The approach fuses mel spectrograms, GCC-PHAT features, and trajectory data through a multi-modal transformer, trained on a large dataset of audio files from interactions. It achieves high-precision localization with MED from in-distribution to under out-of-distribution conditions and demonstrates practical haptic mapping in mock canopies, including zero-shot transfer with audio-only inputs. The work highlights the viability and generalization of acoustic sensing for tactile localization in visually challenging outdoor robotics tasks, suggesting avenues for continuous tracking and multi-contact estimation in future work.

Abstract

In cluttered environments where visual sensors encounter heavy occlusion, such as in agricultural settings, tactile signals can provide crucial spatial information for the robot to locate rigid objects and maneuver around them. We introduce SonicBoom, a holistic hardware and learning pipeline that enables contact localization through an array of contact microphones. While conventional sound source localization methods effectively triangulate sources in air, localization through solid media with irregular geometry and structure presents challenges that are difficult to model analytically. We address this challenge through a feature engineering and learning based approach, autonomously collecting 18,000 robot interaction sound pairs to learn a mapping between acoustic signals and collision locations on the robot end effector link. By leveraging relative features between microphones, SonicBoom achieves localization errors of 0.42cm for in distribution interactions and maintains robust performance of 2.22cm error even with novel objects and contact conditions. We demonstrate the system's practical utility through haptic mapping of occluded branches in mock canopy settings, showing that acoustic based sensing can enable reliable robot navigation in visually challenging environments.

Paper Structure

This paper contains 22 sections, 5 equations, 10 figures.

Figures (10)

  • Figure 1: (a) Outdoor vineyard with occluded rigid branches and trellis that make automation challenging (b) SonicBoom, attached as the robot end-effector link, enables the robot to localize contact using acoustics (red) as the arm swings into collision in an occluded area (c) Using only acoustics to localize contact points (green), the robot can interactively map out the occluded object. (d) SonicBoom contains six contact microphones (blue) enclosed in a PVC pipe to capture vibrotactile signals (e) observed audio signal from the collision.
  • Figure 2: (a) SonicBoom end-effector link parametrized in cylindrical coordinates (b) Striking motion used to create collision acoustic signals are shown with end-effector position and velocity profile.
  • Figure 3: Frequency analysis to de-noise the collision signal. The motor noise and ambient noise can be isolated to specific frequency regions, and filtered to obtain a clean collision signal.
  • Figure 4: System overview of SonicBoom for contact localization in two settings. The inputs used for localization are audio and robot proprioceptive data. Audio signal is pre-processed into mel spectrograms and GCC-PHAT. Each sensing modality is encoded into a latent feature before being fused by the multi-sensory self-attention transformer encoder. The output prediction is represented in cylindrical coordinate $z,\theta$ along SonicBoom surface, which can be used for haptic mapping or localization.
  • Figure 5: Evaluation sets increase in complexity as objects, striking velocity, and striking agents are varied. Training set is composed of variations of simpler single-rod sticks while evaluation set is composed of novel wooden rods and complex tree-like geometric structures.
  • ...and 5 more figures