Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

Jozef Coldenhoff; Milos Cernak

Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

Jozef Coldenhoff, Milos Cernak

TL;DR

This work introduces semi-intrusive audio assessment, reframing quality evaluation as a text-prediction task to emulate human selective listening without a clean reference signal. By inheriting and fine-tuning the PENGI multi-modal framework, the authors demonstrate MOS and an innovative SNR estimation within mixtures, using data simulation to overcome label scarcity. The approach yields competitive MOS results across non-speech and speech/musical data and delivers the first end-to-end SNR estimator capable of focusing on specific sources in a mixture, highlighting the potential for explainable, text-based audio assessment. The method enables flexible, human-like interpretation of audio impairments and paves the way for broader, text-conditioned evaluation of diverse audio signals.

Abstract

Human perception has the unique ability to focus on specific events in a mixture of signals--a challenging task for existing non-intrusive assessment methods. In this work, we introduce semi-intrusive assessment that emulates human attention by framing audio assessment as a text-prediction task with audio-text inputs. To this end, we extend the multi-modal PENGI model through instruction fine-tuning for MOS and SNR estimation. For MOS, our approach achieves absolute Pearson correlation gains of 0.06 and 0.20 over the re-trained MOSRA model and the pre-trained PAM model, respectively. We further propose a novel SNR estimator that can focus on a specific audio source in a mixture, outperforming a random baseline and the fixed-prompt counterpart. Our findings suggest that semi-intrusive assessment can effectively capture human-like selective listening capabilities. Samples are available at https://jozefcoldenhoff.github.io/semi-intrusive-assessment.

Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 1 figure, 6 tables)

This paper contains 22 sections, 3 equations, 1 figure, 6 tables.

Introduction
Methods
Model architecture
Fine-tuning
Prompting and labelling strategy
Data simulation
Quality estimation
SNR estimation
Experiments
Training details
Label quantization strategy
Evaluation
Datasets
Quality estimation
SNR Estimation
...and 7 more sections

Figures (1)

Figure 1: Overview of the PENGI architecture adapted for semi-intrusive audio assessment. The encoders process the audio and text inputs, and together with the mapping networks $m1$ and $m2$, they generate the soft prompt for the GPT2-base language model. Components marked red are fine-tuned during training, while those marked blue have frozen parameters. The masked labels are fed as input to the model during training to enable training with teacher forcing.

Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

TL;DR

Abstract

Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

Authors

TL;DR

Abstract

Table of Contents

Figures (1)