Table of Contents
Fetching ...

A Multimodal Framework for Deepfake Detection

Kashish Gandhi, Prutha Kulkarni, Taran Shah, Piyush Chaudhari, Meera Narvekar, Kranti Ghag

TL;DR

This research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements, and combines visual and auditory analyses, yielding an accuracy of 94%.

Abstract

The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel-spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.

A Multimodal Framework for Deepfake Detection

TL;DR

This research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements, and combines visual and auditory analyses, yielding an accuracy of 94%.

Abstract

The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel-spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.
Paper Structure (16 sections, 7 equations, 9 figures, 3 tables)

This paper contains 16 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Pipeline for proposed methodology
  • Figure 7: Feature importance for deepfake video detection. The x-axis represents the mean absolute SHAP (SHapley Additive exPlanations) values, indicating the average impact of each feature on the model's output. The features include cheekbone height, inter-pupil distance, number of blinks, headpose angles (x, y, z), nose size, lip size, contrast correlation, luminance, chrominance1, chrominance2, and others, listed from 1 to 13, respectively. Higher SHAP values indicate greater importance of a feature in the model's predictions.
  • Figure 8: Mel-spectrograms comparing real (left) and deepfake (right) audio signals reveal distinct differences in time-frequency representation and amplitude. The fake audio often exhibits a broader frequency range and unique spectral signatures, with more harmonics and clearer patterns, unlike the real audio, which includes background noise and vocal imperfections.
  • Figure 9: Plot of mel filter bank weights against mel frequencies and Hertz frequencies. The graph visualizes how triangular mel filters map frequency bands from Hertz to the mel scale, demonstrating the mel scale's frequency distribution b36.
  • Figure 10: Architecture of the Artificial Neural Network (ANN) used for DeepFake video detection, illustrating the feedforward structure with multiple layers and activation functions to capture complex patterns.
  • ...and 4 more figures