Table of Contents
Fetching ...

CNN-based Robust Sound Source Localization with SRP-PHAT for the Extreme Edge

Jun Yin, Marian Verhelst

TL;DR

The paper tackles robust sound source localization on extreme-edge devices by optimizing both SRP-PHAT feature computation and the Cross3D CNN backbone. It introduces LC-SRP-Edge, a low-complexity SRP-PHAT variant using Whittaker-Shannon interpolation, and Cross3D-Edge, a compressed CNN with depthwise separable convolutions and reduced channel counts. Through ablations on synthetic and LOCATA real data, the approach achieves competitive accuracy with dramatically lower hardware footprints, enabling end-to-end latency of 8.59 ms/frame on a Raspberry Pi 4B and up to 116 frames per second. The work demonstrates superior efficiency-accuracy trade-offs over the baseline and some state-of-the-art methods, highlighting practical impact for edge deployments and suggesting directions for future multi-source SSL and SELDT integration.

Abstract

Robust sound source localization for environments with noise and reverberation are increasingly exploiting deep neural networks fed with various acoustic features. Yet, state-of-the-art research mainly focuses on optimizing algorithmic accuracy, resulting in huge models preventing edge-device deployment. The edge, however, urges for real-time low-footprint acoustic reasoning for applications such as hearing aids and robot interactions. Hence, we set off from a robust CNN-based model using SRP-PHAT features, Cross3D [16], to pursue an efficient yet compact model architecture for the extreme edge. For both the SRP feature representation and neural network, we propose respectively our scalable LC-SRP-Edge and Cross3D-Edge algorithms which are optimized towards lower hardware overhead. LC-SRP-Edge halves the complexity and on-chip memory overhead for the sinc interpolation compared to the original LC-SRP [19]. Over multiple SRP resolution cases, Cross3D-Edge saves 10.32~73.71% computational complexity and 59.77~94.66% neural network weights against the Cross3D baseline. In terms of the accuracy-efficiency tradeoff, the most balanced version (EM) requires only 127.1 MFLOPS computation, 3.71 MByte/s bandwidth, and 0.821 MByte on-chip memory in total, while still retaining competitiveness in state-of-the-art accuracy comparisons. It achieves 8.59 ms/frame end-to-end latency on a Rasberry Pi 4B, which is 7.26x faster than the corresponding baseline.

CNN-based Robust Sound Source Localization with SRP-PHAT for the Extreme Edge

TL;DR

The paper tackles robust sound source localization on extreme-edge devices by optimizing both SRP-PHAT feature computation and the Cross3D CNN backbone. It introduces LC-SRP-Edge, a low-complexity SRP-PHAT variant using Whittaker-Shannon interpolation, and Cross3D-Edge, a compressed CNN with depthwise separable convolutions and reduced channel counts. Through ablations on synthetic and LOCATA real data, the approach achieves competitive accuracy with dramatically lower hardware footprints, enabling end-to-end latency of 8.59 ms/frame on a Raspberry Pi 4B and up to 116 frames per second. The work demonstrates superior efficiency-accuracy trade-offs over the baseline and some state-of-the-art methods, highlighting practical impact for edge deployments and suggesting directions for future multi-source SSL and SELDT integration.

Abstract

Robust sound source localization for environments with noise and reverberation are increasingly exploiting deep neural networks fed with various acoustic features. Yet, state-of-the-art research mainly focuses on optimizing algorithmic accuracy, resulting in huge models preventing edge-device deployment. The edge, however, urges for real-time low-footprint acoustic reasoning for applications such as hearing aids and robot interactions. Hence, we set off from a robust CNN-based model using SRP-PHAT features, Cross3D [16], to pursue an efficient yet compact model architecture for the extreme edge. For both the SRP feature representation and neural network, we propose respectively our scalable LC-SRP-Edge and Cross3D-Edge algorithms which are optimized towards lower hardware overhead. LC-SRP-Edge halves the complexity and on-chip memory overhead for the sinc interpolation compared to the original LC-SRP [19]. Over multiple SRP resolution cases, Cross3D-Edge saves 10.32~73.71% computational complexity and 59.77~94.66% neural network weights against the Cross3D baseline. In terms of the accuracy-efficiency tradeoff, the most balanced version (EM) requires only 127.1 MFLOPS computation, 3.71 MByte/s bandwidth, and 0.821 MByte on-chip memory in total, while still retaining competitiveness in state-of-the-art accuracy comparisons. It achieves 8.59 ms/frame end-to-end latency on a Rasberry Pi 4B, which is 7.26x faster than the corresponding baseline.

Paper Structure

This paper contains 27 sections, 9 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: An overview diagram of the modern Sound Source Localization (SSL) practice with Deep Neural Networks (DNN).
  • Figure 2: Cross3D diaz2020robust model structure and workflow. $T$ denotes the length of SRP sequence. The branch depth $N$ is determined by SRP resolution $N = \min(4, \log_2(\min(Res1, Res2))$.
  • Figure 3: Diagrams of the original Cross3D baseline model (a) and the proposed Cross3D-Edge model (b). Res1 and Res2 denotes the SRP's candidate space resolution on the dimension of elevation and azimuth, respectively. The modifications of the algorithm are marked in red text.
  • Figure 4: The computational-complexity and parameter-amount distributions of the original Cross3D diaz2020robust across network layers, demonstrating the fact that Cross_Conv is the most computationally-intensive while Output_Conv1 is the most memory-expensive. Note that the layer name is in line with the diagram in Fig. \ref{['fig:cross3d-methodologies']}, where Output_Conv1 and Output_Conv2 stands for the last two 1D CNN layers, respectively.
  • Figure 5: The localization RMSAE scores (the smaller, the better) of the pre-trained (Diaz-Guerra,2020) and our re-trained Cross3D(Baseline) model. The TD-SRP is used here as the input feature for both models.
  • ...and 5 more figures