Table of Contents
Fetching ...

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-yi Lee

TL;DR

An in-depth analysis is undertaken to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons.

Abstract

The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. Recent years have witnessed significant developments in codec models. The ideal sound codec should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings. This study introduces Codec-SUPERB, an acronym for Codec sound processing Universal PERformance Benchmark. It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge.Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a community-driven benchmark database, thereby stimulating new development cycles for codecs. Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons. Finally, we will release codes, the leaderboard, and data to accelerate progress within the community.

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

TL;DR

An in-depth analysis is undertaken to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons.

Abstract

The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. Recent years have witnessed significant developments in codec models. The ideal sound codec should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings. This study introduces Codec-SUPERB, an acronym for Codec sound processing Universal PERformance Benchmark. It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge.Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a community-driven benchmark database, thereby stimulating new development cycles for codecs. Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons. Finally, we will release codes, the leaderboard, and data to accelerate progress within the community.
Paper Structure (44 sections, 9 figures, 8 tables)

This paper contains 44 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of the Codec-SUPERB platform from two angles: developers and users. From the perspective of developers, they develop and evaluate new codec models across a spectrum of sound applications and signal-level metrics defined in our codebase. Developers then submit their prediction files to the online leaderboard to expand the benchmark database and facilitate comparisons with other codec models. Ultimately, developers utilize the codebase's visualization and statistical tools to analyze performance discrepancies among Codec-SUPERB applications and metrics, thereby gaining invaluable insights for future improvement directions. From the users' perspective, they can contribute datasets and metrics and pick codec models for their downstream application usage.
  • Figure 2: The input sound is compressed using the codec encoder and resynthesized using the codec decoder. Then the resynthesized sound is evaluated from signal-level and application-level angles. Three categories of dataset, speech, audio, and music, are evaluated using five signal-level metrics and one overall score. Also, 4 downstream applications are evaluated.
  • Figure 3: Speech Overall Score vs bitrate.
  • Figure 4: Audio Overall Score vs bitrate.
  • Figure 5: Music Overall Score vs bitrate.
  • ...and 4 more figures