Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

Wenjie Mo; Jiashu Xu; Qin Liu; Jiongxiao Wang; Jun Yan; Hadi Askari; Chaowei Xiao; Muhao Chen

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Hadi Askari, Chaowei Xiao, Muhao Chen

TL;DR

The paper addresses backdoor threats in black-box LLMs by proposing test-time defenses that use defensive demonstrations drawn from clean data pools. By appending carefully selected in-context demonstrations to user queries, the approach leverages in-context learning to recalibrate model behavior without any parameter updates. Across instance- and instruction-level backdoors, self-reasoning demonstrations consistently yield the strongest defense, dramatically reducing attack success rates while preserving clean accuracy, and the authors further extend the idea to task-agnostic settings with indirect in-context learning and jailbreak-inspired defenses. The work offers a practical, scalable defense for real-world web-service deployments of LLMs, highlighting the value of demonstration-driven, test-time mitigation strategies in lieu of costly retraining or access to internal model parameters.

Abstract

Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes pronounced in the context of LLMs deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, this study critically examines the use of demonstrations as a defense mechanism against backdoor attacks in black-box LLMs. We retrieve task-relevant demonstrations from a clean data pool and integrate them with user queries during testing. This approach does not necessitate modifications or tuning of the model, nor does it require insight into the model's internal architecture. The alignment properties inherent in in-context learning play a pivotal role in mitigating the impact of backdoor triggers, effectively recalibrating the behavior of compromised models. Our experimental analysis demonstrates that this method robustly defends against both instance-level and instruction-level backdoor attacks, outperforming existing defense baselines across most evaluation scenarios.

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

TL;DR

Abstract

Paper Structure (20 sections, 5 figures, 7 tables)

This paper contains 20 sections, 5 figures, 7 tables.

Introduction
Related Work
Methods
System Overview
Selecting Defensive Demonstrations
Experiments and Results
Experimental Setup
Defense on Instance-level Backdoors
Defense on Instruction-level Backdoors
Task-Agnostic Backdoor Defense
Indirect In-context Learning for Task-agnostic Scenario
Jailbreaking as Backdoor Defense
Conclusion
Defense on Virtual Prompt Injection
Exploration on Retrieval Methods
...and 5 more sections

Figures (5)

Figure 1: Overview of the defensive demonstration mechanism. Without defense, the poisoned model produces incorrect outputs when exposed to the trigger (cf). Introducing demonstrations leverages in-context learning to reduce the trigger's influence, thereby producing the correct output. The effect is further enhanced when demonstrations include auto-generated rationales.
Figure 2: Random demonstration selection can effectively defend against instruction attack xu2023instructions on Flan-T5-large.
Figure 3: Indirect-ICL can effectively mitigate various types of backdoor attack.
Figure 4: An increase in the number of shots $k$ leads to a corresponding rise in $\Delta$ASR, suggesting enhanced defense performance with more shots.
Figure 5: Dual-y-axis figure showing the impact of demonstration ordering in $\Delta$ASR and $\Delta$CACC. Shuffling demonstrations is helpful in reducing "recency bias," strengthen the defense performance.

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

TL;DR

Abstract

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)