Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran; Xin Wang; Wanying Ge; Xuechen Liu; Junichi Yamagishi

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi

TL;DR

This work investigates a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction, thereby reducing the cost of data collection.

Abstract

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models. While a dedicated synthetic word detector could be developed, we investigate a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thereby reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced by unseen speech generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

TL;DR

Abstract

Paper Structure (20 sections, 2 figures, 4 tables)

This paper contains 20 sections, 2 figures, 4 tables.

Introduction
Methods
Deepfake word detection via next-token prediction
Using vocoded data for fine-tuning Whisper
Experiments
Data and protocols
Models and training configurations
Evaluation metrics
Results on in-domain data
Matched data domain and synthetic methods
Matched data domain but different synthetic methods
Results on out-of-domain test data
Conclusion
Acknowledgment
Generative AI Use Disclosure
...and 5 more sections

Figures (2)

Figure 1: Illustration of a fine-tuned Whisper for speech-to-text transcription and synthetic word detection. <TOF> and <EOF> denote the tokens surrounding a synthetic word.
Figure 2: Analysis of synthetic word detection error rates based on word duration. Black and red profiles correspond to the fine-tuned Whisper using Ft.Voc and Ft.TTS, respectively.

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

TL;DR

Abstract

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Authors

TL;DR

Abstract

Table of Contents

Figures (2)