Table of Contents
Fetching ...

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma

TL;DR

AV-LMMDetect is introduced, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification -"Is this video real or fake?".

Abstract

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

TL;DR

AV-LMMDetect is introduced, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification -"Is this video real or fake?".

Abstract

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
Paper Structure (12 sections, 1 equation, 3 figures, 3 tables)

This paper contains 12 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Performance comparison with Qwen 2.5 Omni xu2025qwen2, highlighting the improved results of our AV-LMMDETECT. The video data is from the open-source MAVOS-DD datasetcroitoru2025mavos.
  • Figure 2: Overview of our AV-LMMDetect. We reformulate audio-visual deepfake detection as a multimodal question answering task. During two-stage fine-tuning, Stage 1 employs LoRA for efficient alignment, while Stage 2 opens vision and audio encoders for audio-visual encoder full learning. The video data is from the open-source MAVOS-DD datasetcroitoru2025mavos.
  • Figure 3: Confusion matrices for AVFF, MRDF, TALL, and our AV-LMMDetect on MAVOS-DD Open-set full scenario.