A Backdoor Approach with Inverted Labels Using Dirty Label-Flipping Attacks

Orson Mengara

A Backdoor Approach with Inverted Labels Using Dirty Label-Flipping Attacks

Orson Mengara

TL;DR

This work addresses the vulnerability of audio-based ML systems to data poisoning and backdoor attacks arising from third-party data. It introduces DirtyFlipping, a backdoor that combines dirty label inversion with a dynamic audio trigger embedded in samples of a target class, enabling stealthy misclassification while preserving accuracy on benign data. The authors validate DirtyFlipping across TIMIT and AudioMNIST, seven neural architectures, and eight audio transformer models, showing high attack success rates (often 100%) with minimal degradation in benign performance, and demonstrate its resistance to several detection defenses. They also show the attack generalizes to pre-trained transformers and discuss defense implications, ethical considerations, and safety guidelines, emphasizing the need for stronger data sanitization and robust backdoor defenses in audio systems.

Abstract

Audio-based machine learning systems frequently use public or third-party data, which might be inaccurate. This exposes deep neural network (DNN) models trained on such data to potential data poisoning attacks. In this type of assault, attackers can train the DNN model using poisoned data, potentially degrading its performance. Another type of data poisoning attack that is extremely relevant to our investigation is label flipping, in which the attacker manipulates the labels for a subset of data. It has been demonstrated that these assaults may drastically reduce system performance, even for attackers with minimal abilities. In this study, we propose a backdoor attack named 'DirtyFlipping', which uses dirty label techniques, "label-on-label", to input triggers (clapping) in the selected data patterns associated with the target class, thereby enabling a stealthy backdoor.

A Backdoor Approach with Inverted Labels Using Dirty Label-Flipping Attacks

TL;DR

Abstract

Paper Structure (17 sections, 1 theorem, 3 equations, 12 figures, 4 tables)

This paper contains 17 sections, 1 theorem, 3 equations, 12 figures, 4 tables.

Introduction
Related work
Proposed Method: Threat Model
Target Label-Flipping Attack Using Dirty Label-Inversion.
Experimental Methodology
Datasets Descritpion.
Victim Models.
Evaluation Metrics.
Characterizing the effectiveness of trigger functions.
Backdoor Attack Performance and impact of DirtyFlipping.
Backdoor Attack Performance.
Impact of DirtyFlipping.
generalization of DirtyFlliping.
Resistance to defenses.
Responsible AI: Toxicity and bias.
...and 2 more sections

Key Result

Theorem 1

Given a classifier $f : X \to Y$, for any data distribution $D$ and any perturbed distribution $\hat{D}$ such that $\hat{D} \in \text{BW}_\infty(D, \epsilon)$, the following inequality holds:

Figures (12)

Figure 1: Illustrates the execution process of a backdoor attack. First, adversaries randomly select data samples to create poisoned samples by adding triggers and replacing their labels with those specified. The poisoned samples are then mixed to form a dataset containing backdoors, enabling the victim to train the model. Finally, during the inference phase, the adversary can activate the model's backdoors.
Figure 2: DirtyFlipping attack.
Figure 3: Illustrates the execution process of a backdoor attack.
Figure 4: Successful insertion of a backdoor trigger into clean data (trigger backdoor positions (sec/delay)).
Figure 5: Poisoning of TIMIT dataset data through successful activation of the 'backdoor' tag. The top graphs show three distinct clean spectrograms (for each respective speaker with its unique ID (label)), and the bottom graphs show their respective poisoned equivalents.
...and 7 more figures

Theorems & Definitions (1)

Theorem 1

A Backdoor Approach with Inverted Labels Using Dirty Label-Flipping Attacks

TL;DR

Abstract

A Backdoor Approach with Inverted Labels Using Dirty Label-Flipping Attacks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (1)