Table of Contents
Fetching ...

Fine-Tuning Is All You Need to Mitigate Backdoor Attacks

Zeyang Sha, Xinlei He, Pascal Berrang, Mathias Humbert, Yang Zhang

TL;DR

This work reframes backdoor defense as a tractable problem by showing that fine-tuning, a standard training operation, can remove backdoors with minimal utility loss across encoder-based, transfer-based, and standalone scenarios. It introduces super-fine-tuning, a dynamic learning-rate schedule inspired by super-convergence, to robustly erase backdoors when conventional fine-tuning falls short. A new concept, backdoor sequela, is proposed to evaluate how defenses affect privacy (membership inference) and re-injection risk, with results suggesting favorable privacy and vulnerability profiles for the proposed approach. The findings suggest that backdoor defenses can be simple and efficient, motivating broader deployment while highlighting the need for more advanced attacks to stress-test defenses.

Abstract

Backdoor attacks represent one of the major threats to machine learning models. Various efforts have been made to mitigate backdoors. However, existing defenses have become increasingly complex and often require high computational resources or may also jeopardize models' utility. In this work, we show that fine-tuning, one of the most common and easy-to-adopt machine learning training operations, can effectively remove backdoors from machine learning models while maintaining high model utility. Extensive experiments over three machine learning paradigms show that fine-tuning and our newly proposed super-fine-tuning achieve strong defense performance. Furthermore, we coin a new term, namely backdoor sequela, to measure the changes in model vulnerabilities to other attacks before and after the backdoor has been removed. Empirical evaluation shows that, compared to other defense methods, super-fine-tuning leaves limited backdoor sequela. We hope our results can help machine learning model owners better protect their models from backdoor threats. Also, it calls for the design of more advanced attacks in order to comprehensively assess machine learning models' backdoor vulnerabilities.

Fine-Tuning Is All You Need to Mitigate Backdoor Attacks

TL;DR

This work reframes backdoor defense as a tractable problem by showing that fine-tuning, a standard training operation, can remove backdoors with minimal utility loss across encoder-based, transfer-based, and standalone scenarios. It introduces super-fine-tuning, a dynamic learning-rate schedule inspired by super-convergence, to robustly erase backdoors when conventional fine-tuning falls short. A new concept, backdoor sequela, is proposed to evaluate how defenses affect privacy (membership inference) and re-injection risk, with results suggesting favorable privacy and vulnerability profiles for the proposed approach. The findings suggest that backdoor defenses can be simple and efficient, motivating broader deployment while highlighting the need for more advanced attacks to stress-test defenses.

Abstract

Backdoor attacks represent one of the major threats to machine learning models. Various efforts have been made to mitigate backdoors. However, existing defenses have become increasingly complex and often require high computational resources or may also jeopardize models' utility. In this work, we show that fine-tuning, one of the most common and easy-to-adopt machine learning training operations, can effectively remove backdoors from machine learning models while maintaining high model utility. Extensive experiments over three machine learning paradigms show that fine-tuning and our newly proposed super-fine-tuning achieve strong defense performance. Furthermore, we coin a new term, namely backdoor sequela, to measure the changes in model vulnerabilities to other attacks before and after the backdoor has been removed. Empirical evaluation shows that, compared to other defense methods, super-fine-tuning leaves limited backdoor sequela. We hope our results can help machine learning model owners better protect their models from backdoor threats. Also, it calls for the design of more advanced attacks in order to comprehensively assess machine learning models' backdoor vulnerabilities.
Paper Structure (28 sections, 17 figures, 1 table)

This paper contains 28 sections, 17 figures, 1 table.

Figures (17)

  • Figure 1: The learning rate scheduler of super-fine-tuning.
  • Figure 2: The performance of whole model fine-tuning and downstream classifier fine-tuning on BadEncoder. The X-axis represents training epochs. The Y-axis represents accuracy.
  • Figure 3: The performance of conventional fine-tuning and super-fine-tuning against different attacks in the transfer-based scenario. The X-axis represents training epochs. The Y-axis represents the accuracy.
  • Figure 4: Accuracy of conventional fine-tuning and super-fine-tuning on backdoor samples and clean samples in the standalone scenario. The X-axis represents training epochs. The Y-axis represents the accuracy. Epoch 0 is the original backdoor ASR and CA before fine-tuning or super-fine-tuning.
  • Figure 5: Comparison between existing state-of-the-art backdoor defenses and super-fine-tuning on CIFAR10. The X-axis represents accuracy on clean samples. The Y-axis represents the attack success rate. Points closer to the lower right corner indicate better defense performance.
  • ...and 12 more figures