Table of Contents
Fetching ...

DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection

Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Reem E. Mohamed, Md Rafiqul Islam, Asif Karim, Sami Azam

TL;DR

DeepAgent tackles the multimodal deepfake detection challenge by deploying two specialized agents: a lightweight AlexNet-based visual detector and an audio-visual semantic consistency detector that leverages MFCCs, Whisper transcripts, and OCR-derived text. A Random Forest meta-classifier fuses their decisions at the decision level, enabling robust performance even under cross-dataset variations. Empirical results on Celeb-DF, FakeAVCeleb, and DeepFakeTIMIT show strong per-agent performance and notably improved cross-dataset generalization, with DeepFakeTIMIT cross-dataset accuracy reaching 97.49% and high AUC. The framework demonstrates that hierarchy-based, multi-agent fusion effectively mitigates modality weaknesses and enhances reliability for real-world deepfake detection scenarios.

Abstract

The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.

DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection

TL;DR

DeepAgent tackles the multimodal deepfake detection challenge by deploying two specialized agents: a lightweight AlexNet-based visual detector and an audio-visual semantic consistency detector that leverages MFCCs, Whisper transcripts, and OCR-derived text. A Random Forest meta-classifier fuses their decisions at the decision level, enabling robust performance even under cross-dataset variations. Empirical results on Celeb-DF, FakeAVCeleb, and DeepFakeTIMIT show strong per-agent performance and notably improved cross-dataset generalization, with DeepFakeTIMIT cross-dataset accuracy reaching 97.49% and high AUC. The framework demonstrates that hierarchy-based, multi-agent fusion effectively mitigates modality weaknesses and enhances reliability for real-world deepfake detection scenarios.

Abstract

The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.

Paper Structure

This paper contains 31 sections, 21 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed DeepAgent architecture, which integrates a lightweight CNN, an audio-visual semantic consistency detector using MFCCs, Whisper-based transcription, and OCR-based frame reading, and a Random Forest meta-classifier.
  • Figure 2: The architecture of Agent-1, a streamlined AlexNet-based CNN model with convolutional, batch normalization, pooling, dense, and dropout layers for real/fake prediction.
  • Figure 3: The architecture of Agent-2, an audio-visual semantic consistency detector that fuses MFCC-based acoustic features, Whisper generated transcripts, and OCR-based frame text, incorporating lexical similarity and a DNN classifier with dense layers and ReLU activation, followed by a sigmoid output to predict real or fake.
  • Figure 4: Confusion matrices of DeepAgent obtained from 5-fold stratified cross-validation on (A) FakeAVCeleb and (B) DeepFakeTIMIT datasets.
  • Figure 5: ROC curves illustrating the performance of the proposed DeepAgent model on (A) FakeAVCeleb dataset and (B) DeepFakeTIMIT dataset