Metric Analysis for Spatial Semantic Segmentation of Sound Scenes
Mayank Mishra, Paul Magron, Romain Serizel
TL;DR
The paper tackles the challenge of evaluating spatial semantic segmentation of sound scenes (S5) with a joint focus on separation and classification. It identifies limitations of the class-aware SDR (CA-SDR) and introduces CASA-SDR, a permutation-aware SDR that maximizes similarity over all source pairings before applying labels, thereby disentangling separation and labeling errors. To further reflect real-world misclassifications, it proposes input- and output-level penalties (IP and OP) with two application modes (non-TP-based and EB), enabling tunable emphasis on labeling versus separation errors. Through a synthetic S5 dataset and controlled error scenarios, CASA-SDR demonstrates improved diagnostic clarity over CA-SDR and classical SDR, highlighting the value of permutation-aware evaluation and penalty-informed metrics for joint S5 assessment.
Abstract
Spatial semantic segmentation of sound scenes (S5) consists of jointly performing audio source separation and sound event classification from a multichannel audio mixture. To evaluate S5 systems, one can consider two individual metrics, i.e., one for source separation and another for sound event classification, but this approach makes it challenging to compare S5 systems. Thus, a joint class-aware signal-to-distortion ratio (CA-SDR) metric was proposed to evaluate S5 systems. In this work, we first compare the CA-SDR with the classical SDR on scenarios with only classification errors. We then analyze the cases where the metric might not allow proper comparison of the systems. To address this problem, we propose a modified version of the CA-SDR which first focuses on class-agnostic SDR and then accounts for the wrongly labeled sources. We also analyze the performance of the two metrics under cross-contamination between separated audio sources. Finally, we propose a first set of penalties in an attempt to make the metric more reflective of the labeling and separation errors.
