Table of Contents
Fetching ...

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi

TL;DR

This work investigates a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction, thereby reducing the cost of data collection.

Abstract

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models. While a dedicated synthetic word detector could be developed, we investigate a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thereby reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced by unseen speech generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

TL;DR

This work investigates a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction, thereby reducing the cost of data collection.

Abstract

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models. While a dedicated synthetic word detector could be developed, we investigate a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thereby reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced by unseen speech generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.
Paper Structure (20 sections, 2 figures, 4 tables)

This paper contains 20 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of a fine-tuned Whisper for speech-to-text transcription and synthetic word detection. <TOF> and <EOF> denote the tokens surrounding a synthetic word.
  • Figure 2: Analysis of synthetic word detection error rates based on word duration. Black and red profiles correspond to the fine-tuned Whisper using Ft.Voc and Ft.TTS, respectively.