Table of Contents
Fetching ...

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji Watanabe

TL;DR

VERSA addresses the need for standardized evaluation in speech, audio, and music by providing 65 metrics with 729 variants within a Python-based toolkit. It unifies diverse evaluation paradigms—ranging from matching and non-matching references to textual and multimodal cues—under a YAML-configured interface, while implementing strict dependency controls and resource caching to ensure reproducibility. Through demonstrations across codec evaluation, TTS, speech enhancement, singing synthesis, and music generation, VERSA showcases cross-domain benchmarking capabilities and promotes fair, comprehensive assessment of generative audio systems. As an open-source, community-driven platform, VERSA aims to accelerate robust benchmarking and reproducible comparisons in AI-generated audio and music technologies.

Abstract

In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/wavlab-speech/versa.

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

TL;DR

VERSA addresses the need for standardized evaluation in speech, audio, and music by providing 65 metrics with 729 variants within a Python-based toolkit. It unifies diverse evaluation paradigms—ranging from matching and non-matching references to textual and multimodal cues—under a YAML-configured interface, while implementing strict dependency controls and resource caching to ensure reproducibility. Through demonstrations across codec evaluation, TTS, speech enhancement, singing synthesis, and music generation, VERSA showcases cross-domain benchmarking capabilities and promotes fair, comprehensive assessment of generative audio systems. As an open-source, community-driven platform, VERSA aims to accelerate robust benchmarking and reproducible comparisons in AI-generated audio and music technologies.

Abstract

In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/wavlab-speech/versa.

Paper Structure

This paper contains 22 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Using various external resources for automatic sound evaluation. External resources include any matching reference signals, non-matching reference signals, transcriptions, visual cues, or textual captions.
  • Figure 2: Directory structure of VERSA. Detailed discussion can be found in Sec. \ref{['sec:versa framework']}