Table of Contents
Fetching ...

Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning

Yu Bi, Yekai Li, Xuan Feng, Xianghang Mi

TL;DR

This work systematically evaluates federated learning for privacy-preserving cyber threat detection, focusing on SMS spam detection and Android malware detection under GDPR-like privacy constraints. It demonstrates that FL can achieve performance comparable to centralized training, is robust to many non-IID data distributions, and resists practical data- and model-poisoning attacks using robust aggregation strategies. The study also identifies efficiency challenges due to non-IID data and proposes a bootstrapping approach to accelerate convergence, while highlighting the nuanced effectiveness of defense methods across tasks. Overall, the findings suggest FL is a promising approach for privacy-preserving cyber threat detection with realistic deployment guidance.

Abstract

Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments. We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14\% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.

Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning

TL;DR

This work systematically evaluates federated learning for privacy-preserving cyber threat detection, focusing on SMS spam detection and Android malware detection under GDPR-like privacy constraints. It demonstrates that FL can achieve performance comparable to centralized training, is robust to many non-IID data distributions, and resists practical data- and model-poisoning attacks using robust aggregation strategies. The study also identifies efficiency challenges due to non-IID data and proposes a bootstrapping approach to accelerate convergence, while highlighting the nuanced effectiveness of defense methods across tasks. Overall, the findings suggest FL is a promising approach for privacy-preserving cyber threat detection with realistic deployment guidance.

Abstract

Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments. We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14\% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.
Paper Structure (18 sections, 14 figures, 25 tables)

This paper contains 18 sections, 14 figures, 25 tables.

Figures (14)

  • Figure 1: The convergence process of the FL baseline models for SMS spam detection and Android malware classification respectively.
  • Figure 2: The stats of FL-trained security classification models under various quantity-based non-IID data distributions. The $\alpha$ denotes the parameter of Dirichlet distribution.
  • Figure 3: The FL training processes for label-based non-IID scenarios.
  • Figure 4: The FL training processes for the scenarios of consistent label imbalance.
  • Figure 5: The FL training processes for the scenarios of family imbalance for malware detection.
  • ...and 9 more figures