Table of Contents
Fetching ...

An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance

Wonwoo Jeong

TL;DR

This work decomposes evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison and reveals a four-axis trade-off in FAD.

Abstract

Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance

TL;DR

This work decomposes evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison and reveals a four-axis trade-off in FAD.

Abstract

Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.
Paper Structure (16 sections, 3 equations, 4 figures, 2 tables)

This paper contains 16 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Four-axis trade-off (Outermost is better). AudioMAE leads Precision; Whisper captures Structural but is invariant to signal degradation; VGGish leads Semantic but limits Recall.
  • Figure 2: Precision response to white noise. (a) Raw FAD (linear scale): dynamic-range disparity compresses low-sensitivity encoders into a visually indistinguishable baseline---this "visual squashing" motivates our normalization. (b) After log-scale normalization ($S_{\mathrm{norm}}$), all six trajectories separate, revealing each encoder's distinct precision profile.
  • Figure 3: Pitch-shift trajectory ($-8$ to $+8$ st). Shaded: recall zone ($\pm$1--2 st). VGGish shows inflexible sensitivity at mild shifts; Whisper maintains the lowest recall-zone response.
  • Figure 4: Diverging bar chart of Structural (left) vs. Semantic (right) sensitivity. Solid bars: Reversal / Pitch +8 st; hatched bars: Shuffle 100 ms / Formant 1.4$\times$. Whisper extends far left (structural-dominant); VGGish extends far right (semantic-dominant)---visually capturing the anti-correlation ($r{=}{-}0.67$).