Can LLMs Help Localize Fake Words in Partially Fake Speech?

Lin Zhang; Thomas Thebaud; Zexin Cai; Sanjeev Khudanpur; Daniel Povey; Leibny Paola García-Perera; Matthew Wiesner; Nicholas Andrews

Can LLMs Help Localize Fake Words in Partially Fake Speech?

Lin Zhang, Thomas Thebaud, Zexin Cai, Sanjeev Khudanpur, Daniel Povey, Leibny Paola García-Perera, Matthew Wiesner, Nicholas Andrews

Abstract

Large language models (LLMs), trained on large-scale text, have recently attracted significant attention for their strong performance across many tasks. Motivated by this, we investigate whether a text-trained LLM can help localize fake words in partially fake speech, where only specific words within a speech are edited. We build a speech LLM to perform fake word localization via next token prediction. Experiments and analyses on AV-Deepfake1M and PartialEdit indicates that the model frequently leverages editing-style pattern learned from the training data, particularly word-level polarity substitutions for those two databases we discussed, as cues for localizing fake words. Although such particular patterns provide useful information in an in-domain scenario, how to avoid over-reliance on such particular pattern and improve generalization to unseen editing styles remains an open question.

Can LLMs Help Localize Fake Words in Partially Fake Speech?

Abstract

Paper Structure (15 sections, 1 equation, 1 figure, 4 tables)

This paper contains 15 sections, 1 equation, 1 figure, 4 tables.

Introduction
From Alignment to Speech LLM
Alignment Baseline
Speech LLM for fake word localization
Experimental Setups
Databases
Experimental setups
Metrics
Results
Fake words localization with the Align model
Can LLMs help localize fake words in partially fake speech? -- Analyses with different modalities.
What patterns do LLMs exploit to localize fake words?
Conclusion
Acknowledgment
Generative AI Disclosure

Figures (1)

Figure 1: Fake-word localization via (a) Alignment between ASR and frame-level detector, and LLM-based approaches with three modality cases: (b) audio-only, (c) transcription-conditioned audio, and (d) transcription-only.

Can LLMs Help Localize Fake Words in Partially Fake Speech?

Abstract

Can LLMs Help Localize Fake Words in Partially Fake Speech?

Authors

Abstract

Table of Contents

Figures (1)