Table of Contents
Fetching ...

Sonic: Fast and Transferable Data Poisoning on Clustering Algorithms

Francesco Villani, Dario Lazzaro, Antonio Emanuele Cinà, Matteo Dell'Amico, Battista Biggio, Fabio Roli

TL;DR

This work tackles the scalability bottleneck of data poisoning against clustering by introducing Sonic, a genetic optimization attack that uses an incremental surrogate clustering model (FISHDBC) to efficiently search for adversarial perturbations. By exploiting the fact that only a small fraction of data is typically manipulated, Sonic reduces recomputation and accelerates poisoning while preserving transferability to target algorithms like HDBSCAN*, DBSCAN, and hierarchical linkages. Empirical results on MNIST, FASHION-MNIST, CIFAR-10, and 20 Newsgroups show Sonic achieves strong attack effectiveness with substantial speedups (up to hundreds of times faster in some settings) and good transferability across clustering families. The findings underscore Sonic’s value for rapid robustness verification of unsupervised clustering systems on large-scale datasets, while also highlighting algorithm-specific vulnerabilities and directions for defense and future study.

Abstract

Data poisoning attacks on clustering algorithms have received limited attention, with existing methods struggling to scale efficiently as dataset sizes and feature counts increase. These attacks typically require re-clustering the entire dataset multiple times to generate predictions and assess the attacker's objectives, significantly hindering their scalability. This paper addresses these limitations by proposing Sonic, a novel genetic data poisoning attack that leverages incremental and scalable clustering algorithms, e.g., FISHDBC, as surrogates to accelerate poisoning attacks against graph-based and density-based clustering methods, such as HDBSCAN. We empirically demonstrate the effectiveness and efficiency of Sonic in poisoning the target clustering algorithms. We then conduct a comprehensive analysis of the factors affecting the scalability and transferability of poisoning attacks against clustering algorithms, and we conclude by examining the robustness of hyperparameters in our attack strategy Sonic.

Sonic: Fast and Transferable Data Poisoning on Clustering Algorithms

TL;DR

This work tackles the scalability bottleneck of data poisoning against clustering by introducing Sonic, a genetic optimization attack that uses an incremental surrogate clustering model (FISHDBC) to efficiently search for adversarial perturbations. By exploiting the fact that only a small fraction of data is typically manipulated, Sonic reduces recomputation and accelerates poisoning while preserving transferability to target algorithms like HDBSCAN*, DBSCAN, and hierarchical linkages. Empirical results on MNIST, FASHION-MNIST, CIFAR-10, and 20 Newsgroups show Sonic achieves strong attack effectiveness with substantial speedups (up to hundreds of times faster in some settings) and good transferability across clustering families. The findings underscore Sonic’s value for rapid robustness verification of unsupervised clustering systems on large-scale datasets, while also highlighting algorithm-specific vulnerabilities and directions for defense and future study.

Abstract

Data poisoning attacks on clustering algorithms have received limited attention, with existing methods struggling to scale efficiently as dataset sizes and feature counts increase. These attacks typically require re-clustering the entire dataset multiple times to generate predictions and assess the attacker's objectives, significantly hindering their scalability. This paper addresses these limitations by proposing Sonic, a novel genetic data poisoning attack that leverages incremental and scalable clustering algorithms, e.g., FISHDBC, as surrogates to accelerate poisoning attacks against graph-based and density-based clustering methods, such as HDBSCAN. We empirically demonstrate the effectiveness and efficiency of Sonic in poisoning the target clustering algorithms. We then conduct a comprehensive analysis of the factors affecting the scalability and transferability of poisoning attacks against clustering algorithms, and we conclude by examining the robustness of hyperparameters in our attack strategy Sonic.
Paper Structure (13 sections, 3 equations, 6 figures, 2 tables)

This paper contains 13 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Robustness analysis is conducted on four datasets: 20 Newsgroups (top-left), MNIST (top-right), FASHIONMNIST (bottom-left), and CIFAR-10 (bottom-right). Each point in the plots represents the outcome of an $(s,\delta)$-experiment, where $s$ ranges from $0.01$ to $0.2$ and $\delta$ ranges from $0.05$ to $0.6$. A regression line is included to illustrate the trend of our results. Additionally, Pearson and Spearman values are reported to indicate the statistical significance of the correlation between the effectiveness of Sonic and SlowP.
  • Figure 2: Robustness analysis on four datasets: 20 Newsgroups (top-left), MNIST (top-right), FASHIONMNIST (bottom-left), and CIFAR-10 (bottom-right). We present results for Sonic at different FISHDBC approximation levels ($\textit{ef}$), where lower $\textit{ef}$ values indicate more accurate approximations of HDBSCAN*. The Pearson Correlation Coefficient (PCC) is provided to show the correlation between the effectiveness of SlowP and Sonic across various $\textit{ef}$ levels.
  • Figure 3: Time analysis for Sonic at different approximation levels ($\textit{ef}$) compared to SlowP on four datasets: 20 Newsgroups (top-left), MNIST (top-right), FASHIONMNIST (bottom-left), and CIFAR-10 (bottom-right). The x-axis represents the percentage of the dataset subjected to poisoning by the attacker, while the y-axis shows the total runtime of the attacks in seconds.
  • Figure 4: Scalability analysis of Sonic at different approximation levels ($\textit{ef}$) compared with SlowP. In the left plot, we increase the feature count of synthetic blob datasets, while in the right plot, we increase the number of samples, keeping the poisoning ratio fixed at $10\%$. The y-axis shows the total runtime of the attacks in seconds as the dataset dimensionality increases.
  • Figure 5: Convergence curves of Sonic showing the best fitness value at each iteration. The left plot illustrates an example of convergence on the 20 Newsgroups dataset, with $\delta$ in $[0.15,0.25]$, while the right plot shows the convergence on the FASHIONMNIST dataset, with $\delta$ in $[0.05,0.2]$. The attacks have been run for 110 iterations each, fixing the poisoning ratio to $15\%$.
  • ...and 1 more figures