Table of Contents
Fetching ...

Unsupervised Variational Acoustic Clustering

Luan Vinícius Fiorio, Bruno Defraene, Johan David, Frans Widdershoven, Wim van Houtum, Ronald M. Aarts

TL;DR

Unsupervised acoustic clustering is addressed by UVAC, which extends variational autoencoders with a Gaussian mixture prior to yield cluster-friendly latent representations of time-frequency audio data. The model employs a convolutional–recurrent encoder–decoder, processes time-context spectrogram windows, and uses a latent space of dimension $d_z=10$ with $C=10$ GMM components. On AudioMNIST, UVAC delivers substantial gains in unsupervised accuracy ($\approx 71\%$) and NMI (≈$0.71$) compared with K-means and EM-based GMM baselines, while achieving favorable Silhouette and DBI scores, illustrating improved clustering of complex audio patterns. This approach demonstrates the potential of integrating variational clustering with audio-specific architectures for efficient, unsupervised analysis in resource-constrained settings.

Abstract

We propose an unsupervised variational acoustic clustering model for clustering audio data in the time-frequency domain. The model leverages variational inference, extended to an autoencoder framework, with a Gaussian mixture model as a prior for the latent space. Specifically designed for audio applications, we introduce a convolutional-recurrent variational autoencoder optimized for efficient time-frequency processing. Our experimental results considering a spoken digits dataset demonstrate a significant improvement in accuracy and clustering performance compared to traditional methods, showcasing the model's enhanced ability to capture complex audio patterns.

Unsupervised Variational Acoustic Clustering

TL;DR

Unsupervised acoustic clustering is addressed by UVAC, which extends variational autoencoders with a Gaussian mixture prior to yield cluster-friendly latent representations of time-frequency audio data. The model employs a convolutional–recurrent encoder–decoder, processes time-context spectrogram windows, and uses a latent space of dimension with GMM components. On AudioMNIST, UVAC delivers substantial gains in unsupervised accuracy () and NMI (≈) compared with K-means and EM-based GMM baselines, while achieving favorable Silhouette and DBI scores, illustrating improved clustering of complex audio patterns. This approach demonstrates the potential of integrating variational clustering with audio-specific architectures for efficient, unsupervised analysis in resource-constrained settings.

Abstract

We propose an unsupervised variational acoustic clustering model for clustering audio data in the time-frequency domain. The model leverages variational inference, extended to an autoencoder framework, with a Gaussian mixture model as a prior for the latent space. Specifically designed for audio applications, we introduce a convolutional-recurrent variational autoencoder optimized for efficient time-frequency processing. Our experimental results considering a spoken digits dataset demonstrate a significant improvement in accuracy and clustering performance compared to traditional methods, showcasing the model's enhanced ability to capture complex audio patterns.

Paper Structure

This paper contains 15 sections, 14 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Inference and generative models for unsupervised clustering.
  • Figure 2: Schematic of the proposed convolutional-recurrent variational autoencoder. The number in each layer indicates output channels. The $\mathrm{C}$ encoder layers consist of Conv2D with BatchNorm2D and ReLU functions in all layers. The $\mathrm{G}$ layers are gate recurrent units (GRUs). The layers of the latent space are linear ($\mathrm{L}$) without activation. The decoder layers are Conv2D.T, with ReLU activation in all layers but the last, with sigmoid. All $\mathrm{C}$ kernels are (8,8) with stride (2,2) and padding (3,3).
  • Figure 3: Clusters obtained for the AudioMNIST test set. The data size is reduced for plotting using t-distributed stochastic neighbor embedding. Each color represents a cluster (labels shown in bar plot), and each circle in the plot is a data point.