Categorical Unsupervised Variational Acoustic Clustering
Luan Vinícius Fiorio, Ivana Nikoloska, Ronald M. Aarts
TL;DR
The paper addresses unsupervised clustering of audio in time-frequency representations where data strongly overlap, such as urban scenes. It extends UVAC by introducing a categorical latent variable implemented via a differentiable Gumbel-Softmax with a tunable temperature $\tau$ to control cluster sharpness. A variational inference framework yields an ELBO with reconstruction and KL terms, enabling end-to-end training. Empirical results on AudioMNIST, TAU2019, and UrbanSound8K show that Cat. UVAC outperforms Gaussian-based UVAC and K-means, particularly in overlapped data, with clustering metrics improving as the cluster count is reduced. This method offers a practical unsupervised clustering approach for audio with potential applications in hearing aids and related processing pipelines, leveraging the discrete latent structure to achieve sharp, meaningful clusters.
Abstract
We propose a categorical approach for unsupervised variational acoustic clustering of audio data in the time-frequency domain. The consideration of a categorical distribution enforces sharper clustering even when data points strongly overlap in time and frequency, which is the case for most datasets of urban acoustic scenes. To this end, we use a Gumbel-Softmax distribution as a soft approximation to the categorical distribution, allowing for training via backpropagation. In this settings, the softmax temperature serves as the main mechanism to tune clustering performance. The results show that the proposed model can obtain impressive clustering performance for all considered datasets, even when data points strongly overlap in time and frequency.
