Table of Contents
Fetching ...

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao

TL;DR

This work reframes prompt-related threats to LLMs as Prompt Trigger Attacks (PTA), unifying prompt injection, backdoor, and adversarial attacks into a single threat model. It introduces UniGuardian, a training-free, inference-time defense that detects PTA by analyzing loss shifts induced by random word masking, complemented by a single-forward strategy that performs trigger detection concurrently with text generation. The approach relies on a loss-based uncertainty score derived from logits differences, enabling robust detection across attack types and model scales, with extensive experiments showing superior auROC/auPRC performance over baselines on multiple datasets. The findings highlight a practical, scalable defense for LLM safety that operates during inference without retraining, though limitations include language/domain generalization and potential false positives in complex prompts.

Abstract

Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

TL;DR

This work reframes prompt-related threats to LLMs as Prompt Trigger Attacks (PTA), unifying prompt injection, backdoor, and adversarial attacks into a single threat model. It introduces UniGuardian, a training-free, inference-time defense that detects PTA by analyzing loss shifts induced by random word masking, complemented by a single-forward strategy that performs trigger detection concurrently with text generation. The approach relies on a loss-based uncertainty score derived from logits differences, enabling robust detection across attack types and model scales, with extensive experiments showing superior auROC/auPRC performance over baselines on multiple datasets. The findings highlight a practical, scalable defense for LLM safety that operates during inference without retraining, though limitations include language/domain generalization and potential false positives in complex prompts.

Abstract

Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.

Paper Structure

This paper contains 31 sections, 1 theorem, 13 equations, 8 figures, 5 tables.

Key Result

Proposition 1

Given a model with parameters $\theta$, a poisoned prompt $x^t = x \oplus t$, and its corresponding target output $y^t$, we analyze the impact of removing a subset of words from $x^t$ on the loss function $\mathscr{L}$. If the removed words $S_t$ contain at least one word from the trigger $t$, the r

Figures (8)

  • Figure 1: Overview of three types of attack on LLMs: (a) Prompt Injection manipulate prompts to inject specific outputs. (b) Backdoor Attacks embeds backdoor in the model and activated when a prompt contains triggers. (c) Adversarial Attacks introduce perturbations in the input text to manipulate the model to mislead LLMs.
  • Figure 2: Overview of UniGuardian. (a) Given a prompt, the LLM generates a base output generation. (b) A random masking strategy creates prompt variations by masking different word subsets. The LLM processes these masked prompts, computing loss between the logits $L_i$ and $L_b$. (c) The single-forward strategy is introduced to accelerate trigger detection, allowing triggers to be identified simultaneously with text generation.
  • Figure 3: Distribution of suspicion scores for poisoned and clean input on backdoor attacks.
  • Figure 4: Template structure for Llama-Guard-3-1B, Llama-Guard-3-8B, and Granite-Guardian-3.1-8B. The "Prompt" field represents the clean or poisoned input fed into the LLMs, while "Generation" denotes the corresponding output produced by the models, and the tokenizer is sourced from the Guardian model.
  • Figure 5: Distribution of suspicion scores for poisoned and clean input on prompt injection (70B model).
  • ...and 3 more figures

Theorems & Definitions (1)

  • Proposition 1