Table of Contents
Fetching ...

Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method

Yuhang Jia, Hui Wang, Xin Nie, Yujie Guo, Lianru Gao, Yong Qin

TL;DR

This work tackles the data and evaluation bottlenecks in text-guided audio editing by introducing AuditScore, a large, human-annotated MOS-style dataset covering diverse editing frameworks and three perceptual dimensions. It then proposes AuditEval, a pair of automatic evaluators (SSL- and LLM-based) to predict human judgments and to filter synthetic pseudo-parallel data, enabling scalable data curation. Empirical results show that AuditEval-llm aligns better with human judgments on semantic aspects like Relevance and Faithfulness, while AuditEval-ssl captures acoustic Quality; filtering guided by these evaluators improves data quality and CLAP-based alignment, though objective metrics may not always track perceptual improvements. The study demonstrates the value of human-informed evaluation for both benchmarking and data construction, and it provides a practical, open-source toolkit for advancing high-quality audio editing research.

Abstract

Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high-quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset construction processes: 1) First, we establish AuditScore, the first comprehensive dataset for subjective evaluation of audio editing, consisting of over 6,300 edited samples generated from 7 representative audio editing frameworks and 23 system configurations. Each sample is annotated by professional raters on three key aspects of audio editing quality: overall Quality, Relevance to editing intent, and Faithfulness to original features. 2) Based on this dataset, we systematically propose AuditEval, a family of automatic MOS-style evaluators tailored for audio editing, covering both SSL-based and LLM-based approaches. It addresses the lack of effective objective metrics and the prohibitive cost of subjective evaluation in this field. 3) We further leverage AuditEval to evaluate and filter a large amount of synthetically mixed editing pairs, mining a high-quality pseudo-parallel subset by selecting the most plausible samples. Comprehensive experiments validate that our expert-informed filtering strategy effectively yields higher-quality data, while also exposing the limitations of traditional objective metrics and the advantages of AuditEval. The dataset, codes and tools can be found at: https://github.com/NKU-HLT/AuditEval.

Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method

TL;DR

This work tackles the data and evaluation bottlenecks in text-guided audio editing by introducing AuditScore, a large, human-annotated MOS-style dataset covering diverse editing frameworks and three perceptual dimensions. It then proposes AuditEval, a pair of automatic evaluators (SSL- and LLM-based) to predict human judgments and to filter synthetic pseudo-parallel data, enabling scalable data curation. Empirical results show that AuditEval-llm aligns better with human judgments on semantic aspects like Relevance and Faithfulness, while AuditEval-ssl captures acoustic Quality; filtering guided by these evaluators improves data quality and CLAP-based alignment, though objective metrics may not always track perceptual improvements. The study demonstrates the value of human-informed evaluation for both benchmarking and data construction, and it provides a practical, open-source toolkit for advancing high-quality audio editing research.

Abstract

Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high-quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset construction processes: 1) First, we establish AuditScore, the first comprehensive dataset for subjective evaluation of audio editing, consisting of over 6,300 edited samples generated from 7 representative audio editing frameworks and 23 system configurations. Each sample is annotated by professional raters on three key aspects of audio editing quality: overall Quality, Relevance to editing intent, and Faithfulness to original features. 2) Based on this dataset, we systematically propose AuditEval, a family of automatic MOS-style evaluators tailored for audio editing, covering both SSL-based and LLM-based approaches. It addresses the lack of effective objective metrics and the prohibitive cost of subjective evaluation in this field. 3) We further leverage AuditEval to evaluate and filter a large amount of synthetically mixed editing pairs, mining a high-quality pseudo-parallel subset by selecting the most plausible samples. Comprehensive experiments validate that our expert-informed filtering strategy effectively yields higher-quality data, while also exposing the limitations of traditional objective metrics and the advantages of AuditEval. The dataset, codes and tools can be found at: https://github.com/NKU-HLT/AuditEval.

Paper Structure

This paper contains 29 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Distributions of AuditScore annotations (left) and AuditEval predictions (right) at both the system and utterance levels.
  • Figure 2: Model designs of AuditEval-llm and AuditEval-ssl, along with the corresponding training modules and inference workflows.
  • Figure 3: Score prediction distributions on the large-scale Pseudo-Parallel Audio Editing Dataset, with addition, deletion, and modification operations shown from top to bottom.
  • Figure 4: Score prediction distributions on the large-scale Pseudo-Parallel Audio Editing Dataset, with addition, deletion, and modification operations shown from top to bottom.
  • Figure 5: User Interface for Subjective Evaluation.
  • ...and 2 more figures