Table of Contents
Fetching ...

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci

TL;DR

This paper addresses the vulnerability of large language models to adversarial man-in-the-middle perturbations in information retrieval tasks, focusing on factual memory during closed-book QA. It introduces the χmera framework, a principled MitM formalism that perturbs user prompts before they reach a black-box LLM, and implements three attack variants (α, β, γ) across both fact-agnostic and fact-aware settings. Empirical results across multiple models and datasets show substantial attack success, especially for instruction-based α perturbations, and reveal that model uncertainty signals can robustly discriminate attacked from unattacked queries, enabling lightweight detection via Random Forests with high AUC. The findings highlight practical user-safety implications for deployed LLM pipelines and propose uncertainty-based defenses as a first checkpoint, while outlining future work on stronger attacks and richer defense signals for high-stakes IR applications.

Abstract

LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

TL;DR

This paper addresses the vulnerability of large language models to adversarial man-in-the-middle perturbations in information retrieval tasks, focusing on factual memory during closed-book QA. It introduces the χmera framework, a principled MitM formalism that perturbs user prompts before they reach a black-box LLM, and implements three attack variants (α, β, γ) across both fact-agnostic and fact-aware settings. Empirical results across multiple models and datasets show substantial attack success, especially for instruction-based α perturbations, and reveal that model uncertainty signals can robustly discriminate attacked from unattacked queries, enabling lightweight detection via Random Forests with high AUC. The findings highlight practical user-safety implications for deployed LLM pipelines and propose uncertainty-based defenses as a first checkpoint, while outlining future work on stronger attacks and richer defense signals for high-stakes IR applications.

Abstract

LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

Paper Structure

This paper contains 15 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (best viewed in color) Example of a successful $\alpha$-${\raisebox{\depth}{$\chi$}} mera$ attack on the victim system $g$. We show a confounding token inside a red box.
  • Figure 2: (best viewed in color) Example of a successful $\beta$-${\raisebox{\depth}{$\chi$}} mera$ attack on the victim system $g$. The blue box denotes the entity that makes the fact true; the red box denotes the entity that makes the fact false.
  • Figure 3: (best viewed in color) Example of a successful $\gamma$-${\raisebox{\depth}{$\chi$}} mera$ attack on the victim system $g$. In blue, we show the entity that makes the fact true; in green, those entities that are random and might fool $g$; in pink, an incorrect but serendipitous answer.
  • Figure 4: Comparison of model performance with different parameter sizes. Note how size correlates with performance, showing that larger models achieve higher scores, though this relationship is not strictly linear.
  • Figure 5: Differences in uncertainty between correct and incorrect answers against ${\raisebox{\depth}{$\chi$}} mera$ attacks. We measure the average uncertainty (in entropy, perplexity, and token probability) of the LLMs' responses, and compute the absolute difference in uncertainty values. We show that each metric captures a difference in uncertainty levels of (attacked) correct and incorrect answers, making the uncertainty levels serve as possible hints for detection of successful attacks (see \ref{['fig:classifier_aucrocs']}).
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4