Table of Contents
Fetching ...

PAM: Prompting Audio-Language Models for Audio Quality Assessment

Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

TL;DR

This work extensively evaluates the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS).

Abstract

While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric.

PAM: Prompting Audio-Language Models for Audio Quality Assessment

TL;DR

This work extensively evaluates the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS).

Abstract

While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric.
Paper Structure (28 sections, 3 equations, 6 figures, 12 tables)

This paper contains 28 sections, 3 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Effects of PAM when adding different distortions: (a) Gaussian Noise (b) Gaussian Noise with SNR (c) Tanh distortion (d) Mu-Law compression (e) Reverb. The dataset is a professionally recorded dataset containing sounds. PAM decreases as the distortion of the signal increases.
  • Figure 2: (PCC) Correlation plots between MOS (human subjective evaluations) and two prompt strategies for four distortions applied to the NISQA dataset. The top row refers to using the naive single-prompt strategy and the second row shows PAM (two opposite prompt strategy). Both described in Fig. \ref{['fig:distortions_main']}. Using one prompt for AQA does not correlate with MOS, but using two prompts (PAM) does.
  • Figure 3: The absolute value of (PCC) correlation between OVL, REL and different TTA generation models. The input text captions come from the AudioCaps dataset. PAM, as a single-metric, is capable of correlating well with both, overall quality (OVL) and relevance to the input caption (REL).
  • Figure 4: PCC between listening test MOS and Fréchet Audio Distance (FAD) and PAM, for acoustic (AQ, left) and musical quality (MQ, right).
  • Figure 5: Absolute PCC between the human evaluation and metrics for TTS models. The transcripts are sourced from the LibriTTS dataset. On average, PAM correlates better with human perception of speech quality than existing metrics.
  • ...and 1 more figures