Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications
David Diaz-Guerra, Archontis Politis, Antonio Miguel, Jose R. Beltran, Tuomas Virtanen
TL;DR
This work tackles the permutation sensitivity of recurrent tracking in multi-source sound localization by introducing a permutation-invariant recurrent neural network (PI-RNN) that operates on unordered input and state sets. The PI-RNN uses a multi-head attention mechanism to match input embeddings to state embeddings and updates via a GRU-like mechanism, with equations such as $\mathbf{h}_i(t) = [1-\mathbf{z}_i(t)] \odot \mathbf{h}_i(t-1) + \tilde{\mathbf{h}}_i(t) \mathbf{z}_i(t)$ and $\tilde{\mathbf{h}}_i(t)=\tanh(\mathbf{c}_i(t)\mathbf{W}^h)$, ensuring invariance to input permutations and equivariance to state permutations. The PI-RNN is evaluated as an add-on to an icoCNN-based localization system, using ACCDOA embeddings of size $d$ and training with sPIT on synthetic, reverberant scenes with up to 3 active sources, showing reduced localization error and fewer identity switches compared to baselines. The results indicate that exploiting permutation symmetries yields better tracking performance and scalability, with potential further gains if more spectral information is incorporated. Overall, the work provides a novel, set-based recurrent layer that can be integrated into diverse tracking architectures to improve multi-source sound source tracking.
Abstract
Many multi-source localization and tracking models based on neural networks use one or several recurrent layers at their final stages to track the movement of the sources. Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state. However, this approach results in the information from all the sources being contained in a single ordered vector, which is not optimal for permutation-invariant problems such as multi-source tracking. In this paper, we present a new recurrent architecture that uses unordered sets to represent both its input and its state and that is invariant to the permutations of the input set and equivariant to the permutations of the state set. Hence, the information of every sound source is represented in an individual embedding and the new estimates are assigned to the tracked trajectories regardless of their order.
