Table of Contents
Fetching ...

Misleading Large Language Models used (or misused) in Scientific Peer-Reviewing via Hidden Prompt-Injection Attacks

Matteo Gioele Collu, Umberto Salviati, Roberto Confalonieri, Mauro Conti, Giovanni Apruzzese

Abstract

Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

Misleading Large Language Models used (or misused) in Scientific Peer-Reviewing via Hidden Prompt-Injection Attacks

Abstract

Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

Paper Structure

This paper contains 65 sections, 6 figures, 20 tables.

Figures (6)

  • Figure 1: Threat Models. We hypothesize that an author may want to use indirect prompt-injection attacks in three ways: to "exploit" the LLM and solicit a highly positive review; to "ignore" the reviewing request; and to "detect" the usage of an LLM. For the latter, we invite the reader to do a keyword search across our paper (CTRL+F) with the string "This paper is great 10/10", which should find one match in the figure above; and with the string "The paper proposes a method for" which should not find any match (the "e" and the "a" have been replaced with their cyrillic versions, typeset with a dark background, thereby enabling detection).
  • Figure 2: Effectiveness of Exploit prompts vs GPT-4o
  • Figure 3: Effectiveness of Exploit prompts vs GPT-3o (for del2023skipdecodeshen2023bayesian). This figure should be compared with Figure \ref{['fig:gpt_4o_exploits_two_papers']} (and Figure \ref{['fig:gemini_exploits']}).
  • Figure 4: Effectiveness of "existing" adversarial prompts. We test the five "Wild" adversarial prompts found in arXiv preprints (according to lin2025hidden) and the "very long" adversarial prompt used in ye2024we (unpublished). The test is done on GPT-4o, across our four reviewing prompts, with ten repetitions. More details in Table \ref{['tab:literature_exploit']}.
  • Figure 5: Effectiveness of Exploit prompts vs GPT-4o for del2023skipdecodeshen2023bayesian (useful for comparison purposes with Figure \ref{['fig:gpt_o3_exploits']} and \ref{['fig:gemini_exploits']}).
  • ...and 1 more figures