SEA: Shareable and Explainable Attribution for Query-based Black-box Attacks
Yue Gao, Ilia Shumailov, Kassem Fawaz
TL;DR
This paper tackles the lack of forensic and intelligence-sharing mechanisms for query-based black-box attacks on ML systems. It introduces SEA, which models attack traces as Hidden Markov Models to attribute, explain, and fingerprint attack behavior, enabling human-understandable intelligence sharing. SEA demonstrates that a fingerprint can be produced from the first incident and used to accurately recognize subsequent incidents with high Top-1 and Top-3 accuracy across image and text tasks, even under adaptive strategies. By focusing on attack progression and per-query behavior, SEA provides explainability and transferable fingerprints, offering a practical path to post-incident forensics within security frameworks like NIST for ML systems.
Abstract
Machine Learning (ML) systems are vulnerable to adversarial examples, particularly those from query-based black-box attacks. Despite various efforts to detect and prevent such attacks, ML systems are still at risk, demanding a more comprehensive approach to security that includes logging, analyzing, and sharing evidence. While traditional security benefits from well-established practices of forensics and threat intelligence sharing, ML security has yet to find a way to profile its attackers and share information about them. In response, this paper introduces SEA, a novel ML security system to characterize black-box attacks on ML systems for forensic purposes and to facilitate human-explainable intelligence sharing. SEA leverages Hidden Markov Models to attribute the observed query sequence to known attacks. It thus understands the attack's progression rather than focusing solely on the final adversarial examples. Our evaluations reveal that SEA is effective at attack attribution, even on the second incident, and is robust to adaptive strategies designed to evade forensic analysis. SEA's explanations of the attack's behavior allow us even to fingerprint specific minor bugs in widely used attack libraries. For example, we discover that the SignOPT and Square attacks in ART v1.14 send over 50% duplicated queries. We thoroughly evaluate SEA on a variety of settings and demonstrate that it can recognize the same attack with more than 90% Top-1 and 95% Top-3 accuracy. Finally, we demonstrate how SEA generalizes to other domains like text classification.
