Unsupervised Variational Acoustic Clustering
Luan Vinícius Fiorio, Bruno Defraene, Johan David, Frans Widdershoven, Wim van Houtum, Ronald M. Aarts
TL;DR
Unsupervised acoustic clustering is addressed by UVAC, which extends variational autoencoders with a Gaussian mixture prior to yield cluster-friendly latent representations of time-frequency audio data. The model employs a convolutional–recurrent encoder–decoder, processes time-context spectrogram windows, and uses a latent space of dimension $d_z=10$ with $C=10$ GMM components. On AudioMNIST, UVAC delivers substantial gains in unsupervised accuracy ($\approx 71\%$) and NMI (≈$0.71$) compared with K-means and EM-based GMM baselines, while achieving favorable Silhouette and DBI scores, illustrating improved clustering of complex audio patterns. This approach demonstrates the potential of integrating variational clustering with audio-specific architectures for efficient, unsupervised analysis in resource-constrained settings.
Abstract
We propose an unsupervised variational acoustic clustering model for clustering audio data in the time-frequency domain. The model leverages variational inference, extended to an autoencoder framework, with a Gaussian mixture model as a prior for the latent space. Specifically designed for audio applications, we introduce a convolutional-recurrent variational autoencoder optimized for efficient time-frequency processing. Our experimental results considering a spoken digits dataset demonstrate a significant improvement in accuracy and clustering performance compared to traditional methods, showcasing the model's enhanced ability to capture complex audio patterns.
