Table of Contents
Fetching ...

Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

Aravinda Reddy PN, Raghavendra Ramachandra, Krothapalli Sreenivasa Rao, Pabitra Mitra, Vinod Rathod

TL;DR

The paper tackles the challenge of robust audio-visual deepfake detection by introducing STGS-BMNAS, a differentiable bi-modal neural architecture search framework that jointly discovers unimodal feature selection and weighted fusion strategies. It employs a two-level search using Straight-through Gumbel-Softmax to explore architecture space while maintaining end-to-end trainability, achieving high AUC on FakeAVCeleb and SWAN-DF with relatively few parameters. Key contributions include a principled STGS-based NAS, a cell-based fusion scheme, and thorough ablations and cross-dataset evaluations demonstrating strong performance and efficient resource use. This method has practical impact by enabling scalable, adaptable AV deepfake detectors suitable for real-world deployment with limited computational budgets.

Abstract

Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.

Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

TL;DR

The paper tackles the challenge of robust audio-visual deepfake detection by introducing STGS-BMNAS, a differentiable bi-modal neural architecture search framework that jointly discovers unimodal feature selection and weighted fusion strategies. It employs a two-level search using Straight-through Gumbel-Softmax to explore architecture space while maintaining end-to-end trainability, achieving high AUC on FakeAVCeleb and SWAN-DF with relatively few parameters. Key contributions include a principled STGS-based NAS, a cell-based fusion scheme, and thorough ablations and cross-dataset evaluations demonstrating strong performance and efficient resource use. This method has practical impact by enabling scalable, adaptable AV deepfake detectors suitable for real-world deployment with limited computational budgets.

Abstract

Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.
Paper Structure (20 sections, 17 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 17 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: An overview of our proposed STGS-BMNAS for AV Deepfake detection. (a) Two level search based architecture b) Average entropy plot for two learnable parameters for the proposed STGS-BMNAS and Softmax yin2022bm.
  • Figure 2: Block diagram indicating the multimodal fusion network proposed by STGS-BMNAS, which consists of two level searching scheme. In the first level we search for features from backbone network. Each cell accepts two inputs from its previous cells. In the second stage we search for optimal architecture searched using our proposed STGS over the cells through pool primitive operations and finally concatenate the cell outputs for prediction.
  • Figure 3: An overview of the conceptual visualization of our proposed STGS-BMNAS: a) Initially, an acyclic graph is predefined, with cells receiving inputs from the backbone network. b) During forward propagation for the first level search (indicated by colors), we utilize our proposed Gumbel Softmax to sample features from the backbone network. Subsequently, in the next stage, we also use the same Gumbel Softmax to sample an optimal architecture. During the backward pass, we employ Straight Through Estimator to simultaneously calculate gradients and network. c) Finally, we obtain the final network using our proposed Straight Through Gumbel Softmax estimator.
  • Figure 5: Optimal architecture obtained with temperature $\lambda=10$ and sampling M=15 for second type of evaluation protocol.
  • Figure : (a)