Table of Contents
Fetching ...

BWSNet: Automatic Perceptual Assessment of Audio Signals

Clément Le Moine Veillon, Victor Rosi, Pablo Arias Sarah, Léane Salais, Nicolas Obin

TL;DR

This work tackles automatic perceptual assessment from Best-Worst Scaling judgments by introducing BWSNet, a metric-learning model that maps audio to a latent space where distances reflect perceptual similarity with respect to a studied attribute. It translates trial-wise ordinal judgments into distance constraints and optimizes a composite loss that includes a Relative Contrastive term with a dynamic margin network, a Dynamic Margin Constraint, and a fulfilled-relations penalty. The method is validated on two diverse datasets—speech social attitudes and instrumental timbre—showing latent spaces that align with human judgments and enabling prediction of perceptual structure for unseen samples. The approach offers a scalable, data-efficient framework for mapping complex perceptual spaces and provides insights into how human judgments organize audio attributes, with potential to enhance automated perceptual evaluation.

Abstract

This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements.

BWSNet: Automatic Perceptual Assessment of Audio Signals

TL;DR

This work tackles automatic perceptual assessment from Best-Worst Scaling judgments by introducing BWSNet, a metric-learning model that maps audio to a latent space where distances reflect perceptual similarity with respect to a studied attribute. It translates trial-wise ordinal judgments into distance constraints and optimizes a composite loss that includes a Relative Contrastive term with a dynamic margin network, a Dynamic Margin Constraint, and a fulfilled-relations penalty. The method is validated on two diverse datasets—speech social attitudes and instrumental timbre—showing latent spaces that align with human judgments and enabling prediction of perceptual structure for unseen samples. The approach offers a scalable, data-efficient framework for mapping complex perceptual spaces and provides insights into how human judgments organize audio attributes, with potential to enhance automated perceptual evaluation.

Abstract

This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements.
Paper Structure (18 sections, 6 equations, 3 figures, 1 table)

This paper contains 18 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: A BWS trial $t^{a} \in \mathcal{T}^{a}$ of $N=4$ sounds judged with respect to the attribute $a$ (left) and the derived relations (right).
  • Figure 2: The Mel-Spectrograms of the $N=4$ samples related to a BWS trial $t^{a}$ investigating the attribute $a$ are passed to BWSNet. The model yields BWS embeddings of which relative position is changed over training.
  • Figure 3: BWSNet's latent space UMAP vizualization for social attitudes (left) and timbral qualities (right). Each point is a sample whose BWS score is represented by its colour (left) and size (right) respectively.