Table of Contents
Fetching ...

Low-power SNN-based audio source localisation using a Hilbert Transform spike encoding scheme

Saeid Haghighatshoar, Dylan R Muir

TL;DR

This work addresses low-power direction-of-arrival estimation for wideband audio on sensor arrays by replacing dense narrowband filtering with a Hilbert-transform–based beamforming framework that exploits the analytic-signal phase. It introduces online Short-Time Hilbert Transform (STHT) and Robust Zero-Crossing Conjugate (RZCC) spike encoding to realize real-time, energy-efficient DoA estimation within spiking neural networks, and proves an equivalence between real-valued and complex-valued beamformers to simplify hardware implementations. The approach achieves high DoA accuracy on noisy wideband signals and speech, demonstrates state-of-the-art SNN performance, and demonstrates deployment on ultra-low-power hardware (Xylo) with milli-watt power envelopes. Comparisons with MUSIC show competitive accuracy while reducing computational burden by avoiding per-band filterbanks, enabling practical ultra-low-power audio localization in IoT devices. The results suggest a co-design path where Hilbert-transform–based DSP and SNN hardware coalesce to deliver accurate, energy-efficient audio localization for diverse microphone geometries.

Abstract

Sound source localisation is used in many consumer devices, to isolate audio from individual speakers and reject noise. Localization is frequently accomplished by ``beamforming'', which combines phase-shifted audio streams to increase power from chosen source directions, under a known microphone array geometry. Dense band-pass filters are often needed to obtain narrowband signal components from wideband audio. These approaches achieve high accuracy, but narrowband beamforming is computationally demanding, and not ideal for low-power IoT devices. We demonstrate a novel method for sound source localisation on arbitrary microphone arrays, designed for efficient implementation in ultra-low-power spiking neural networks (SNNs). We use a Hilbert transform to avoid dense band-pass filters, and introduce a new event-based encoding method that captures the phase of the complex analytic signal. Our approach achieves state-of-the-art accuracy for SNN methods, comparable with traditional non-SNN super-resolution beamforming. We deploy our method to low-power SNN inference hardware, with much lower power consumption than super-resolution methods. We demonstrate that signal processing approaches co-designed with spiking neural network implementations can achieve much improved power efficiency. Our new Hilbert-transform-based method for beamforming can also improve the efficiency of traditional DSP-based signal processing.

Low-power SNN-based audio source localisation using a Hilbert Transform spike encoding scheme

TL;DR

This work addresses low-power direction-of-arrival estimation for wideband audio on sensor arrays by replacing dense narrowband filtering with a Hilbert-transform–based beamforming framework that exploits the analytic-signal phase. It introduces online Short-Time Hilbert Transform (STHT) and Robust Zero-Crossing Conjugate (RZCC) spike encoding to realize real-time, energy-efficient DoA estimation within spiking neural networks, and proves an equivalence between real-valued and complex-valued beamformers to simplify hardware implementations. The approach achieves high DoA accuracy on noisy wideband signals and speech, demonstrates state-of-the-art SNN performance, and demonstrates deployment on ultra-low-power hardware (Xylo) with milli-watt power envelopes. Comparisons with MUSIC show competitive accuracy while reducing computational burden by avoiding per-band filterbanks, enabling practical ultra-low-power audio localization in IoT devices. The results suggest a co-design path where Hilbert-transform–based DSP and SNN hardware coalesce to deliver accurate, energy-efficient audio localization for diverse microphone geometries.

Abstract

Sound source localisation is used in many consumer devices, to isolate audio from individual speakers and reject noise. Localization is frequently accomplished by ``beamforming'', which combines phase-shifted audio streams to increase power from chosen source directions, under a known microphone array geometry. Dense band-pass filters are often needed to obtain narrowband signal components from wideband audio. These approaches achieve high accuracy, but narrowband beamforming is computationally demanding, and not ideal for low-power IoT devices. We demonstrate a novel method for sound source localisation on arbitrary microphone arrays, designed for efficient implementation in ultra-low-power spiking neural networks (SNNs). We use a Hilbert transform to avoid dense band-pass filters, and introduce a new event-based encoding method that captures the phase of the complex analytic signal. Our approach achieves state-of-the-art accuracy for SNN methods, comparable with traditional non-SNN super-resolution beamforming. We deploy our method to low-power SNN inference hardware, with much lower power consumption than super-resolution methods. We demonstrate that signal processing approaches co-designed with spiking neural network implementations can achieve much improved power efficiency. Our new Hilbert-transform-based method for beamforming can also improve the efficiency of traditional DSP-based signal processing.
Paper Structure (21 sections, 4 theorems, 70 equations, 12 figures, 2 tables)

This paper contains 21 sections, 4 theorems, 70 equations, 12 figures, 2 tables.

Key Result

Theorem 1

Let $x_a(t)$ be the analytic signal corresponding to a real-valued signal $x(t)$ and let $e(t)$ and $\phi(t)$ be its envelope and phase respectively. Let $T>0$ be such that the time interval $t \in [0, T]$ contains major part of the energy of the signal. Define the average of $e(t)$ over this interv is the spectral average of the frequency of the signal. $\hbox{$\sqcap$$\sqcup$}{ \hbox{$\sqcap$$\s

Figures (12)

  • Figure 1: Narrowband and Hilbert beamforming.a Geometry of a circular microphone array. b Narrow-band beamforming approach. A dense filterbank or Fourier transform (FFT/DFT) provides narrowband signals, which are delayed and then combined through a large beamforming weight tensor to estimate direction of incident audio (DoA) as the peak power direction. An SNN may be used for performing beamforming and estimating DoApan2021multi. c DoA estimation geometry for far-field audio. d Our novel wide-band Hilbert beamforming approach. Wideband analytic signals $X_A$ are obtained by a Hilbert transform. Wideband analytic signals are combined through a small beamforming weight matrix to estimate DoA. e The phase progression $\phi$ (coloured lines) of wideband analytic signals $X_A$ generated from wideband noise with central frequency $F_C=2$ kHz are very similar to the narrowband signal with frequency $F=2$ kHz. f--g Beam patterns from applying Hilbert beamforming to narrowband signals with $F=2$ kHz (f), and wideband signals with center frequency $F_C=2$ kHz.
  • Figure 2: Online Short-Time Hilbert Transform (STHT) and Robust Zero-Crossing Conjugate (RZCC) event encoding.a The STHT kernel (top) which estimates the quadrature component $x_Q$ of a signal, and frequency response of the kernel (bottom). b A noisy narrowband input signal ($x$; blue), with the quadrature component obtained from an infinite-time Hilbert transform ($x_Q$; dashed) and the STHT-derived version ($\hat{x}_Q$; orange). Note the onset transient response of the filter before and around $t<2$ ms. Corresponding up- and down-zero crossing encoding events estimated from $x$ and $\hat{x}_Q$ are shown at top. c For a given signal (solid), zero crossings events (red) are estimated robustly by finding the peaks and troughs of the cumulative sum (dashed), within a window (ZC window).
  • Figure 3: Audio localization with STHT, RZCC encoding and DoA inference with a Spiking Neural Network (SNN).a The pipeline for SNN implementation of our Hilbert beamforming and DoA estimation, combining Short-Time Hilbert transform; Zero-crossing conjugate encoding; analytically-derived beamforming weights; and Leaky-Integrate-and-Fire (LIF) spiking neurons for power accumulation and DoA estimation. b--c Beam patterns for SNN STHT RZCC beamforming, for narrowband (b; $F=2$ kHz) and wideband (c; $F_C=2$ kHz) signals. d--e Beam power and DoA estimates for noisy narrowband signals (d) and for noisy encoded speech (e). Dashed lines: estimated DoA. f--g DoA estimation error for noisy narrowband signals (f) and noisy speech (g). Dashed line: $1.0^{\circ}$. Annotations: Mean Absolute Error (MAE). Box plot: centre line: median; box limits: quartiles; whiskers: 1.5$\times$ inter-quartile range; points: outliers. $n=100$ random trials.
  • Figure S1: The Hilbert transform applied to wideband speech signals.a A Raw speech sample. b The real (blue) and imaginary (orange) components of the Hilbert analytic signal version of the speech sample. Note their similarity to the raw sample, with the phase shift of the imaginary component. c The amplitude (green) and unwrapped phase (purple) of the analytic signal. Note the smooth phase procession of the analytic signal. The instantaneous dominant frequency can be obtained by the slope of the phase signal.
  • Figure S2: Illustration of the almost-linear behavior of phase for superposition of two complex exponentials.
  • ...and 7 more figures

Theorems & Definitions (25)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Example 1
  • Theorem 1
  • Remark 5
  • proof
  • Example 2
  • Example 3
  • ...and 15 more