AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Henry Zhong; Jörg M. Buchholz; Julian Maclaren; Simon Carlile; Richard F. Lyon

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Henry Zhong, Jörg M. Buchholz, Julian Maclaren, Simon Carlile, Richard F. Lyon

TL;DR

AuditoryHuM tackles the labor-intensive problem of annotating auditory scenes by using Multimodal Large Language Models to generate contextually relevant labels, grounded with zero-shot CLAP alignment and refined through targeted HITL. It then clusters the text-label embeddings with an adjusted silhouette score that includes a penalty $\\lambda$ derived from AIC to balance cohesion and granularity, producing thematically meaningful composite labels. The approach is validated across ADVANCE, AHEAD-DS, and TAU 2019, demonstrating strong alignment with human perception, robust cluster structure, and practical composite labels suitable for training edge-deployable scene recognisers. This workflow yields a scalable, low-cost taxonomy suitable for deployment on devices like hearing aids and smart home assistants, while highlighting the importance of human-in-the-loop checks and careful metric design to mitigate MLLM hallucinations.

Abstract

Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and thematic granularity. Evaluated across three diverse auditory scene datasets (ADVANCE, AHEAD-DS, and TAU 2019), AuditoryHuM provides a scalable, low-cost solution for creating standardised taxonomies. This solution facilitates the training of lightweight scene recognition models deployable to edge devices, such as hearing aids and smart home assistants. The project page and code: https://github.com/Australian-Future-Hearing-Initiative

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

TL;DR

derived from AIC to balance cohesion and granularity, producing thematically meaningful composite labels. The approach is validated across ADVANCE, AHEAD-DS, and TAU 2019, demonstrating strong alignment with human perception, robust cluster structure, and practical composite labels suitable for training edge-deployable scene recognisers. This workflow yields a scalable, low-cost taxonomy suitable for deployment on devices like hearing aids and smart home assistants, while highlighting the importance of human-in-the-loop checks and careful metric design to mitigate MLLM hallucinations.

Abstract

Paper Structure (17 sections, 3 equations, 4 figures, 9 tables)

This paper contains 17 sections, 3 equations, 4 figures, 9 tables.

Introduction
Related Work
Method
MLLM Choices
Alignment and Clustering Choices
Dataset Choices
Results
Label Alignment
Clustering
Composite Labels
Sensitivity Analysis
CLAP Implementation
Label Cleanup
Sentence-Transformer Implementation
Clustering Algorithm
...and 2 more sections

Figures (4)

Figure 1: Schematic of the AuditoryHuM processing, showing the steps from MLLM label discovery, human-in-the-loop refinement and cleanup, text encoding, and clustering.
Figure 2: The adjusted silhouette scores $s_{adj}$. silhouette scores range from 1 to $-1$, where 1 represents perfect cluster cohesion, 0 represents a random cluster assignment, and $-1$ represents wrong cluster assignment. The value $s_{adj}$ ranges from $1 - \lambda \times k$ to $-1 - \lambda \times k$. A higher value is better, therefore the optimal $k$ value occurs at the peak of each curve.
Figure 3: t-SNE visualisations using the optimal $k$ value for each dataset. Each unique colour represents a cluster of thematically related labels.
Figure 4: The adjusted silhouette scores $s_{adj}$ comparing the datasets using all-mpnet-base-v2 and all-MiniLM-L6-v2. silhouette scores range from 1 to $-1$, where 1 represents perfect cluster cohesion and $-1$ represents wrong cluster assignment. The value $s_{adj}$ ranges from $1 - \lambda \times k$ to $-1 - \lambda \times k$. A higher value is better.

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

TL;DR

Abstract

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Authors

TL;DR

Abstract

Table of Contents

Figures (4)