From Theory to Practice: Evaluating Data Poisoning Attacks and Defenses in In-Context Learning on Social Media Health Discourse
Rabeya Amin Jhuma, Mostafa Mohaimen Akand Faisal
TL;DR
This paper investigates how in-context learning (ICL) in large language models can be disrupted by data poisoning in public health sentiment analysis, using HMPV-related tweets and small support perturbations that flip sentiment labels. It applies a Spectral Signature Defense to filter poisoned support, then evaluates post-defense ICL performance and sentiment integrity, alongside embedding-based validation. The results show that poisoning can cause substantial label flips (high fragility of ICL), while spectral defense can remove a portion of poisoned examples and stabilize dataset semantics, though ICL accuracy remains around 46.7% post-defense. The findings underscore the need for robust, hybrid defense strategies combining anomaly detection, adaptive prompting, and embedding-based validation to ensure reliable AI in health discourse monitoring.
Abstract
This study explored how in-context learning (ICL) in large language models can be disrupted by data poisoning attacks in the setting of public health sentiment analysis. Using tweets of Human Metapneumovirus (HMPV), small adversarial perturbations such as synonym replacement, negation insertion, and randomized perturbation were introduced into the support examples. Even these minor manipulations caused major disruptions, with sentiment labels flipping in up to 67% of cases. To address this, a Spectral Signature Defense was applied, which filtered out poisoned examples while keeping the data's meaning and sentiment intact. After defense, ICL accuracy remained steady at around 46.7%, and logistic regression validation reached 100% accuracy, showing that the defense successfully preserved the dataset's integrity. Overall, the findings extend prior theoretical studies of ICL poisoning to a practical, high-stakes setting in public health discourse analysis, highlighting both the risks and potential defenses for robust LLM deployment. This study also highlights the fragility of ICL under attack and the value of spectral defenses in making AI systems more reliable for health-related social media monitoring.
