Naturalness-Aware Curriculum Learning with Dynamic Temperature for Speech Deepfake Detection
Taewoo Kim, Guisik Kim, Choongsang Cho, Young Han Lee
TL;DR
This work tackles robustness gaps in speech deepfake detection by incorporating speech naturalness as a perceptual cue through a naturalness-aware curriculum learning framework and a dynamic temperature scaling mechanism. By predicting mean opinion scores (MOS) and using them to rank sample difficulty, the model progressively learns from easy to harder examples, while per-sample temperature scaling calibrates confidence according to perceptual difficulty. The approach yields substantial improvements on the ASVspoof 2021 DF dataset (up to a 23% relative $EER$ reduction) and generalizes well to unseen data, including the In-The-Wild corpus, without altering the backbone architecture. Ablation studies corroborate the effectiveness of both curriculum learning and dynamic temperature in boosting SDD performance and generalization across diverse scenarios.
Abstract
Recent advances in speech deepfake detection (SDD) have significantly improved artifacts-based detection in spoofed speech. However, most models overlook speech naturalness, a crucial cue for distinguishing bona fide speech from spoofed speech. This study proposes naturalness-aware curriculum learning, a novel training framework that leverages speech naturalness to enhance the robustness and generalization of SDD. This approach measures sample difficulty using both ground-truth labels and mean opinion scores, and adjusts the training schedule to progressively introduce more challenging samples. To further improve generalization, a dynamic temperature scaling method based on speech naturalness is incorporated into the training process. A 23% relative reduction in the EER was achieved in the experiments on the ASVspoof 2021 DF dataset, without modifying the model architecture. Ablation studies confirmed the effectiveness of naturalness-aware training strategies for SDD tasks.
