UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao
TL;DR
This work reframes prompt-related threats to LLMs as Prompt Trigger Attacks (PTA), unifying prompt injection, backdoor, and adversarial attacks into a single threat model. It introduces UniGuardian, a training-free, inference-time defense that detects PTA by analyzing loss shifts induced by random word masking, complemented by a single-forward strategy that performs trigger detection concurrently with text generation. The approach relies on a loss-based uncertainty score derived from logits differences, enabling robust detection across attack types and model scales, with extensive experiments showing superior auROC/auPRC performance over baselines on multiple datasets. The findings highlight a practical, scalable defense for LLM safety that operates during inference without retraining, though limitations include language/domain generalization and potential false positives in complex prompts.
Abstract
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.
