CNN-based Robust Sound Source Localization with SRP-PHAT for the Extreme Edge

Jun Yin; Marian Verhelst

CNN-based Robust Sound Source Localization with SRP-PHAT for the Extreme Edge

Jun Yin, Marian Verhelst

TL;DR

The paper tackles robust sound source localization on extreme-edge devices by optimizing both SRP-PHAT feature computation and the Cross3D CNN backbone. It introduces LC-SRP-Edge, a low-complexity SRP-PHAT variant using Whittaker-Shannon interpolation, and Cross3D-Edge, a compressed CNN with depthwise separable convolutions and reduced channel counts. Through ablations on synthetic and LOCATA real data, the approach achieves competitive accuracy with dramatically lower hardware footprints, enabling end-to-end latency of 8.59 ms/frame on a Raspberry Pi 4B and up to 116 frames per second. The work demonstrates superior efficiency-accuracy trade-offs over the baseline and some state-of-the-art methods, highlighting practical impact for edge deployments and suggesting directions for future multi-source SSL and SELDT integration.

Abstract

Robust sound source localization for environments with noise and reverberation are increasingly exploiting deep neural networks fed with various acoustic features. Yet, state-of-the-art research mainly focuses on optimizing algorithmic accuracy, resulting in huge models preventing edge-device deployment. The edge, however, urges for real-time low-footprint acoustic reasoning for applications such as hearing aids and robot interactions. Hence, we set off from a robust CNN-based model using SRP-PHAT features, Cross3D [16], to pursue an efficient yet compact model architecture for the extreme edge. For both the SRP feature representation and neural network, we propose respectively our scalable LC-SRP-Edge and Cross3D-Edge algorithms which are optimized towards lower hardware overhead. LC-SRP-Edge halves the complexity and on-chip memory overhead for the sinc interpolation compared to the original LC-SRP [19]. Over multiple SRP resolution cases, Cross3D-Edge saves 10.32~73.71% computational complexity and 59.77~94.66% neural network weights against the Cross3D baseline. In terms of the accuracy-efficiency tradeoff, the most balanced version (EM) requires only 127.1 MFLOPS computation, 3.71 MByte/s bandwidth, and 0.821 MByte on-chip memory in total, while still retaining competitiveness in state-of-the-art accuracy comparisons. It achieves 8.59 ms/frame end-to-end latency on a Rasberry Pi 4B, which is 7.26x faster than the corresponding baseline.

CNN-based Robust Sound Source Localization with SRP-PHAT for the Extreme Edge

TL;DR

Abstract

CNN-based Robust Sound Source Localization with SRP-PHAT for the Extreme Edge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)