Table of Contents
Fetching ...

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

Feiyang Xiao, Jian Guan, Qiaoxi Zhu, Xubo Liu, Wenbo Wang, Shuhan Qi, Kejia Zhang, Jianyuan Sun, Wenwu Wang

TL;DR

This paper tackles the challenge of evaluating language-queried audio source separation (LASS) without ground-truth references by introducing CLAPScore, a reference-free semantic similarity metric derived from the CLAP model. CLAPScore quantifies how well the separated audio aligns semantically with the text query via cosine similarity between audio and text embeddings, and it is extended with CLAPScore-i and RefCLAPScore for improvement and reference-enabled evaluation. Experiments on the DCASE 2024 Task 9 validation set show that CLAPScore correlates with traditional SDR-based metrics and can differentiate separation quality in real-world, reference-free scenarios. The work provides a practical, semantic-aware evaluation tool for LASS with publicly available code, enabling more realistic benchmarking in multi-source audio data.

Abstract

Language-queried audio source separation (LASS) aims to separate an audio source guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being commonly used to objectively measure the quality of the separated audio. However, the SDR-based metrics require a reference signal, which is often difficult to obtain in real-world scenarios. In addition, with the SDR-based metrics, the content information of the text query is not considered effectively in LASS. This paper introduces a reference-free evaluation metric using a contrastive language-audio pretraining (CLAP) module, termed CLAPScore, which measures the semantic similarity between the separated audio and the text query. Unlike SDR, the proposed CLAPScore metric evaluates the quality of the separated audio based on the content information of the text query, without needing a reference signal. Experiments show that the CLAPScore provides an effective evaluation of the semantic relevance of the separated audio to the text query, as compared to the SDR metric, offering an alternative for the performance evaluation of LASS systems. The code for evaluation is publicly available.

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

TL;DR

This paper tackles the challenge of evaluating language-queried audio source separation (LASS) without ground-truth references by introducing CLAPScore, a reference-free semantic similarity metric derived from the CLAP model. CLAPScore quantifies how well the separated audio aligns semantically with the text query via cosine similarity between audio and text embeddings, and it is extended with CLAPScore-i and RefCLAPScore for improvement and reference-enabled evaluation. Experiments on the DCASE 2024 Task 9 validation set show that CLAPScore correlates with traditional SDR-based metrics and can differentiate separation quality in real-world, reference-free scenarios. The work provides a practical, semantic-aware evaluation tool for LASS with publicly available code, enabling more realistic benchmarking in multi-source audio data.

Abstract

Language-queried audio source separation (LASS) aims to separate an audio source guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being commonly used to objectively measure the quality of the separated audio. However, the SDR-based metrics require a reference signal, which is often difficult to obtain in real-world scenarios. In addition, with the SDR-based metrics, the content information of the text query is not considered effectively in LASS. This paper introduces a reference-free evaluation metric using a contrastive language-audio pretraining (CLAP) module, termed CLAPScore, which measures the semantic similarity between the separated audio and the text query. Unlike SDR, the proposed CLAPScore metric evaluates the quality of the separated audio based on the content information of the text query, without needing a reference signal. Experiments show that the CLAPScore provides an effective evaluation of the semantic relevance of the separated audio to the text query, as compared to the SDR metric, offering an alternative for the performance evaluation of LASS systems. The code for evaluation is publicly available.
Paper Structure (15 sections, 8 equations, 4 figures, 2 tables)

This paper contains 15 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the limitation of the SDR-based metrics for the evaluation of the language-queried audio source separation (LASS) methods in the real-world scenario, where the reference audio required by the SDR-based metrics is unavailable. Therefore, the SDR-based metrics are unusable for the evaluation of the LASS methods in the real-world scenario.
  • Figure 2: Illustration of the evaluation process with the proposed CLAPScore metric for language-queried audio source separation. Notably, the proposed CLAPScore metric does not need a reference audio for the evaluation. The inputs of the proposed CLAPScore metric, i.e., the estimated audio and the text query, are available in both simulation and real-world scenarios. Therefore, the CLAPScore metric can be applicable for both such scenarios.
  • Figure 3: Illustration to show the correlation between the SDR metric and the proposed CLAPScore metric. Here, the separated audio comes from the LASS method, i.e., AudioSep audiosep.
  • Figure 4: Illustration of the proposed CLAPScore metric for the mixtures from different mixing strategies.