Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci
TL;DR
This paper addresses the vulnerability of large language models to adversarial man-in-the-middle perturbations in information retrieval tasks, focusing on factual memory during closed-book QA. It introduces the χmera framework, a principled MitM formalism that perturbs user prompts before they reach a black-box LLM, and implements three attack variants (α, β, γ) across both fact-agnostic and fact-aware settings. Empirical results across multiple models and datasets show substantial attack success, especially for instruction-based α perturbations, and reveal that model uncertainty signals can robustly discriminate attacked from unattacked queries, enabling lightweight detection via Random Forests with high AUC. The findings highlight practical user-safety implications for deployed LLM pipelines and propose uncertainty-based defenses as a first checkpoint, while outlining future work on stronger attacks and richer defense signals for high-stakes IR applications.
Abstract
LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
