BWSNet: Automatic Perceptual Assessment of Audio Signals
Clément Le Moine Veillon, Victor Rosi, Pablo Arias Sarah, Léane Salais, Nicolas Obin
TL;DR
This work tackles automatic perceptual assessment from Best-Worst Scaling judgments by introducing BWSNet, a metric-learning model that maps audio to a latent space where distances reflect perceptual similarity with respect to a studied attribute. It translates trial-wise ordinal judgments into distance constraints and optimizes a composite loss that includes a Relative Contrastive term with a dynamic margin network, a Dynamic Margin Constraint, and a fulfilled-relations penalty. The method is validated on two diverse datasets—speech social attitudes and instrumental timbre—showing latent spaces that align with human judgments and enabling prediction of perceptual structure for unseen samples. The approach offers a scalable, data-efficient framework for mapping complex perceptual spaces and provides insights into how human judgments organize audio attributes, with potential to enhance automated perceptual evaluation.
Abstract
This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements.
