An Efficient GPU-based Implementation for Noise Robust Sound Source Localization
Zirui Lin, Masayuki Takigahira, Naoya Terakado, Haris Gulzar, Monikka Roslianna Busto, Takeharu Eda, Katsutoshi Itoyama, Kazuhiro Nakadai, Hideharu Amano
TL;DR
The paper tackles the CPU bottleneck of sound source localization (SSL) for robot audition with large microphone arrays by introducing a GPU-based GSVD-MUSIC implementation within the HARK platform. It demonstrates substantial real-time capable speedups on both embedded (Jetson AGX Orin) and server-class (NVIDIA A100) hardware, enabling 60-channel SSL with room for follow-on ML/DL tasks. The approach preserves SSL accuracy, achieving RMSE on the order of $10^{-6}$ and perfect consistency across devices, while detailing a scalable parallelization strategy centered on the GSVD component using CUDA. The work significantly broadens practical SSL deployment in embedded and cloud-like environments and points to streaming-mode future work to further reduce latency and improve responsiveness in dynamic scenarios.
Abstract
Robot audition, encompassing Sound Source Localization (SSL), Sound Source Separation (SSS), and Automatic Speech Recognition (ASR), enables robots and smart devices to acquire auditory capabilities similar to human hearing. Despite their wide applicability, processing multi-channel audio signals from microphone arrays in SSL involves computationally intensive matrix operations, which can hinder efficient deployment on Central Processing Units (CPUs), particularly in embedded systems with limited CPU resources. This paper introduces a GPU-based implementation of SSL for robot audition, utilizing the Generalized Singular Value Decomposition-based Multiple Signal Classification (GSVD-MUSIC), a noise-robust algorithm, within the HARK platform, an open-source software suite. For a 60-channel microphone array, the proposed implementation achieves significant performance improvements. On the Jetson AGX Orin, an embedded device powered by an NVIDIA GPU and ARM Cortex-A78AE v8.2 64-bit CPUs, we observe speedups of 5648.7x for GSVD calculations and 10.7x for the SSL module, while speedups of 4245.1x for GSVD calculation and 17.3x for the entire SSL module on a server configured with an NVIDIA A100 GPU and AMD EPYC 7352 CPUs, making real-time processing feasible for large-scale microphone arrays and providing ample capacity for real-time processing of potential subsequent machine learning or deep learning tasks.
