Table of Contents
Fetching ...

SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures

Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali

TL;DR

DeepFake audio threatens voice-verified security and information integrity. The authors propose SecureSpectra, which signs audio with an irreversible, orthogonal high-frequency signature using a private key: $a_i^* = S(a_i, \kappa_i; \theta_\mathcal{S})$, and detects it via a public verifier $\phi$ trained with a joint loss $\mathcal{L}_\mathcal{S}+\mathcal{L}_\phi$, while employing differential privacy on keys to prevent reverse engineering. The method demonstrates up to a 71–81% boost in detection accuracy over baselines and reduces EER across multiple datasets (CV, LibriSpeech, VoxCeleb), with a modest ~4% accuracy loss when DP is enabled. By leveraging DF’s difficulty in reproducing high-frequency content, SecureSpectra provides a model-agnostic, open-source defense for digital voice identity in security-critical contexts such as banking and political communication.

Abstract

Advancements in DeepFake (DF) audio models pose a significant threat to voice authentication systems, leading to unauthorized access and the spread of misinformation. We introduce a defense mechanism, SecureSpectra, addressing DF threats by embedding orthogonal, irreversible signatures within audio. SecureSpectra leverages the inability of DF models to replicate high-frequency content, which we empirically identify across diverse datasets and DF models. Integrating differential privacy into the pipeline protects signatures from reverse engineering and strikes a delicate balance between enhanced security and minimal performance compromises. Our evaluations on Mozilla Common Voice, LibriSpeech, and VoxCeleb datasets showcase SecureSpectra's superior performance, outperforming recent works by up to 71% in detection accuracy. We open-source SecureSpectra to benefit the research community.

SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures

TL;DR

DeepFake audio threatens voice-verified security and information integrity. The authors propose SecureSpectra, which signs audio with an irreversible, orthogonal high-frequency signature using a private key: , and detects it via a public verifier trained with a joint loss , while employing differential privacy on keys to prevent reverse engineering. The method demonstrates up to a 71–81% boost in detection accuracy over baselines and reduces EER across multiple datasets (CV, LibriSpeech, VoxCeleb), with a modest ~4% accuracy loss when DP is enabled. By leveraging DF’s difficulty in reproducing high-frequency content, SecureSpectra provides a model-agnostic, open-source defense for digital voice identity in security-critical contexts such as banking and political communication.

Abstract

Advancements in DeepFake (DF) audio models pose a significant threat to voice authentication systems, leading to unauthorized access and the spread of misinformation. We introduce a defense mechanism, SecureSpectra, addressing DF threats by embedding orthogonal, irreversible signatures within audio. SecureSpectra leverages the inability of DF models to replicate high-frequency content, which we empirically identify across diverse datasets and DF models. Integrating differential privacy into the pipeline protects signatures from reverse engineering and strikes a delicate balance between enhanced security and minimal performance compromises. Our evaluations on Mozilla Common Voice, LibriSpeech, and VoxCeleb datasets showcase SecureSpectra's superior performance, outperforming recent works by up to 71% in detection accuracy. We open-source SecureSpectra to benefit the research community.
Paper Structure (6 sections, 5 equations, 4 figures, 1 table)

This paper contains 6 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Digital Identity Secured Voice Authentication: Imagine a public figure, Alice, who releases a speech of $a_i$. Malicious Eve (top row) employs a DF model $\mathcal{G}$, parameterized by $\theta_\mathcal{G}$, to mimic Alice's voice in a transcript $\mathcal{T}$, creating a clone $\Tilde{a}_i^\mathcal{T}$. We aim to develop a verification module $\phi$ parameterized by $\theta_\phi$ to decide $\hat{y}_a$ if an audio $a$ comes from Alice. Our approach (bottom row) first gives Alice a private key $\kappa_i$. Then, our novel signature module (green) combines her voice $a_i$ with her key $\kappa_i$ to produce signed audio $a_i^*$. The signed audio closely resembles the original while being distinguishable from fake versions. The verifier $\phi$ can identify the signature in an audio without revealing it. If Eve attempts to use the signed audio $a_i^*$ in her model, the generated clone $\Tilde{a}_i^\mathcal{T}$ does not contain the signature. The key and the signature module are kept confidential (red) to prevent attacks.
  • Figure 2: Key Observation for High-Frequency Regime: By comparing the spectrograms of the original audio (top) and its cloned version (middle) for the same transcript, we observe a distinct absence of HF content (green) in the DF audio. This discrepancy arises from the bias of DF models toward mimicking user speech, which predominantly emphasizes lower-frequency regions. A U-net (right) signs the audio (bottom) with unrecognizable slight modifications (yellow) in the HF regime.
  • Figure 3: Spectral Analysis of Original and Cloned Audio: We empirically analyzed the spectral content across the original audio recordings (blue) and their cloned counterparts (orange, green, red) generated by state-of-the-art DF models. The analysis encompasses CommonVoice, LibriSpeech, and VoxCeleb datasets, with all audio samples converted into spectrograms. The frequency spectrum was segmented into bins, each representing a 600 Hz bandwidth, where the energy content within each bin was averaged and plotted. The results, derived from testing on three known audio datasets with three advanced DF models, highlight a discernible attenuation in the HF components in the DF-generated audio compared to the original ones, indicating a characteristic shortfall of the DF models in replicating the HF energy profile of genuine audio recordings.
  • Figure 4: User-Level Performance Across Benchmarks: We evaluate the DF detection benchmark test accuracies across 100 distinct users with 200 audio samples each (100 original, 100 cloned). The orange and blue box plots show the accuracies of the two recent works. The green box plot provides a baseline of our pipeline without signature embedding. The purple and red box plots show the performance of our approach with and without DP noise, respectively. Our method, particularly with signature embedding, surpasses existing models, enhancing verification-only accuracy by 81% and outperforming comparative works by 71% and 42%. DP noise adds additional security with a marginal decrease in accuracy by 4%.