Table of Contents
Fetching ...

KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

Yoonjin Chung, Pilsun Eu, Junwon Lee, Keunwoo Choi, Juhan Nam, Ben Sangbae Chon

TL;DR

The Kernel Audio Distance is introduced, a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD), which provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

Abstract

Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

TL;DR

The Kernel Audio Distance is introduced, a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD), which provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

Abstract

Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison between KAD (Kernel Audio Distance) and FAD (Fréchet Audio Distance). KAD is a distribution-free metric that does not require any underlying assumptions for embedding distributions $P$ and $Q$.
  • Figure 3: Spearman correlations between metric scores and human perceptual ratings for different embedding models. Since lower scores imply better results for both metrics, correlation values are negative. Correlation values are multiplied by -1 for the convenience of visualization. KAD (orange) consistently achieves higher alignment than FAD (blue).
  • Figure 4: Normalized FAD and KAD scores against increasing embedding sample size. Scores are normalized by their respective extrapolated values at $N=\infty$. The shaded regions indicate standard deviations.
  • Figure 5: Comparison of FAD and KAD wall-clock computation times. (a) $N=1000$ with varying $d$. (b) $d=2048$ with varying $N$. Solid lines indicate CPU usage and dotted lines indicate GPU usage. Error bars mark the 5th to 95th percentile of 200 trials.
  • Figure 6: Effect of audio degradations to the MMD values, where the maximum value is normalized to 1 for each bandwidth result. MMD values that are smaller than $\varepsilon=10^{-12}$ are clipped to $\varepsilon$ before normalization.