Table of Contents
Fetching ...

Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications

David Diaz-Guerra, Archontis Politis, Antonio Miguel, Jose R. Beltran, Tuomas Virtanen

TL;DR

This work tackles the permutation sensitivity of recurrent tracking in multi-source sound localization by introducing a permutation-invariant recurrent neural network (PI-RNN) that operates on unordered input and state sets. The PI-RNN uses a multi-head attention mechanism to match input embeddings to state embeddings and updates via a GRU-like mechanism, with equations such as $\mathbf{h}_i(t) = [1-\mathbf{z}_i(t)] \odot \mathbf{h}_i(t-1) + \tilde{\mathbf{h}}_i(t) \mathbf{z}_i(t)$ and $\tilde{\mathbf{h}}_i(t)=\tanh(\mathbf{c}_i(t)\mathbf{W}^h)$, ensuring invariance to input permutations and equivariance to state permutations. The PI-RNN is evaluated as an add-on to an icoCNN-based localization system, using ACCDOA embeddings of size $d$ and training with sPIT on synthetic, reverberant scenes with up to 3 active sources, showing reduced localization error and fewer identity switches compared to baselines. The results indicate that exploiting permutation symmetries yields better tracking performance and scalability, with potential further gains if more spectral information is incorporated. Overall, the work provides a novel, set-based recurrent layer that can be integrated into diverse tracking architectures to improve multi-source sound source tracking.

Abstract

Many multi-source localization and tracking models based on neural networks use one or several recurrent layers at their final stages to track the movement of the sources. Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state. However, this approach results in the information from all the sources being contained in a single ordered vector, which is not optimal for permutation-invariant problems such as multi-source tracking. In this paper, we present a new recurrent architecture that uses unordered sets to represent both its input and its state and that is invariant to the permutations of the input set and equivariant to the permutations of the state set. Hence, the information of every sound source is represented in an individual embedding and the new estimates are assigned to the tracked trajectories regardless of their order.

Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications

TL;DR

This work tackles the permutation sensitivity of recurrent tracking in multi-source sound localization by introducing a permutation-invariant recurrent neural network (PI-RNN) that operates on unordered input and state sets. The PI-RNN uses a multi-head attention mechanism to match input embeddings to state embeddings and updates via a GRU-like mechanism, with equations such as and , ensuring invariance to input permutations and equivariance to state permutations. The PI-RNN is evaluated as an add-on to an icoCNN-based localization system, using ACCDOA embeddings of size and training with sPIT on synthetic, reverberant scenes with up to 3 active sources, showing reduced localization error and fewer identity switches compared to baselines. The results indicate that exploiting permutation symmetries yields better tracking performance and scalability, with potential further gains if more spectral information is incorporated. Overall, the work provides a novel, set-based recurrent layer that can be integrated into diverse tracking architectures to improve multi-source sound source tracking.

Abstract

Many multi-source localization and tracking models based on neural networks use one or several recurrent layers at their final stages to track the movement of the sources. Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state. However, this approach results in the information from all the sources being contained in a single ordered vector, which is not optimal for permutation-invariant problems such as multi-source tracking. In this paper, we present a new recurrent architecture that uses unordered sets to represent both its input and its state and that is invariant to the permutations of the input set and equivariant to the permutations of the state set. Hence, the information of every sound source is represented in an individual embedding and the new estimates are assigned to the tracked trajectories regardless of their order.
Paper Structure (7 sections, 7 equations, 7 figures)

This paper contains 7 sections, 7 equations, 7 figures.

Figures (7)

  • Figure 1: Architecture of the proposed permutation invariant recurrent layer.
  • Figure 2: Architecture of the icoCNN used for evaluation. B is the batch size, T is the number of temporal frames of the acoustic scenes, $H=2^r=8$ and $W=2^{r+1}=16$ are the height and the width of the projections of the icosahedral grid.
  • Figure 3: Architecture of the PI-RNN used after the icoCNN in the evaluated model.
  • Figure 4: Architecture of the conventional RNN used after the icoCNN in the baseline model.
  • Figure 5: Evaluation metrics for proposed PI-RNN and the baseline models.
  • ...and 2 more figures