Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Narek Maloyan; Dmitry Namiot

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Narek Maloyan, Dmitry Namiot

TL;DR

This work analyzes the vulnerability of LLM-as-a-judge systems to adversarial prompt injections, distinguishing content-author and system-prompt threats. It introduces a three-component attack framework and evaluates five judge-models, four tasks, and multiple defenses, revealing attack success up to 73.8% and notable transferability among open-source models. The study shows that defense-in-depth, especially diverse multi-model committees, markedly improves robustness and provides practical guidelines for safer evaluation pipelines. These findings underscore the need for architecturally diverse, layered defenses in high-stakes AI evaluation contexts and offer reproducible artifacts for ongoing security research.

Abstract

LLM as judge systems used to assess text quality code correctness and argument strength are vulnerable to prompt injection attacks. We introduce a framework that separates content author attacks from system prompt attacks and evaluate five models Gemma 3.27B Gemma 3.4B Llama 3.2 3B GPT 4 and Claude 3 Opus on four tasks with various defenses using fifty prompts per condition. Attacks achieved up to seventy three point eight percent success smaller models proved more vulnerable and transferability ranged from fifty point five to sixty two point six percent. Our results contrast with Universal Prompt Injection and AdvPrompter We recommend multi model committees and comparative scoring and release all code and datasets

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

TL;DR

Abstract

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

TL;DR

Abstract

Paper Structure

Table of Contents