Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao; Yuqi Li; Yunpeng Luo; Jianjun Yin; Long Ma

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma

TL;DR

AV-LMMDetect is introduced, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification -"Is this video real or fake?".

Abstract

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

TL;DR

AV-LMMDetect is introduced, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification -"Is this video real or fake?".

Abstract

Paper Structure (12 sections, 1 equation, 3 figures, 3 tables)

This paper contains 12 sections, 1 equation, 3 figures, 3 tables.

Introduction
Related Work
Methodology
Two-Stage Training Strategy
Question Answering Formulation
Top-2 Token Prediction and Evaluation
Experiment
Datasets
Experiment Results
Ablation Experiment
Confusion Matrix Analysis
Conclusion

Figures (3)

Figure 1: Performance comparison with Qwen 2.5 Omni xu2025qwen2, highlighting the improved results of our AV-LMMDETECT. The video data is from the open-source MAVOS-DD datasetcroitoru2025mavos.
Figure 2: Overview of our AV-LMMDetect. We reformulate audio-visual deepfake detection as a multimodal question answering task. During two-stage fine-tuning, Stage 1 employs LoRA for efficient alignment, while Stage 2 opens vision and audio encoders for audio-visual encoder full learning. The video data is from the open-source MAVOS-DD datasetcroitoru2025mavos.
Figure 3: Confusion matrices for AVFF, MRDF, TALL, and our AV-LMMDetect on MAVOS-DD Open-set full scenario.

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

TL;DR

Abstract

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Authors

TL;DR

Abstract

Table of Contents

Figures (3)