Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Hadi Askari, Chaowei Xiao, Muhao Chen
TL;DR
The paper addresses backdoor threats in black-box LLMs by proposing test-time defenses that use defensive demonstrations drawn from clean data pools. By appending carefully selected in-context demonstrations to user queries, the approach leverages in-context learning to recalibrate model behavior without any parameter updates. Across instance- and instruction-level backdoors, self-reasoning demonstrations consistently yield the strongest defense, dramatically reducing attack success rates while preserving clean accuracy, and the authors further extend the idea to task-agnostic settings with indirect in-context learning and jailbreak-inspired defenses. The work offers a practical, scalable defense for real-world web-service deployments of LLMs, highlighting the value of demonstration-driven, test-time mitigation strategies in lieu of costly retraining or access to internal model parameters.
Abstract
Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes pronounced in the context of LLMs deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, this study critically examines the use of demonstrations as a defense mechanism against backdoor attacks in black-box LLMs. We retrieve task-relevant demonstrations from a clean data pool and integrate them with user queries during testing. This approach does not necessitate modifications or tuning of the model, nor does it require insight into the model's internal architecture. The alignment properties inherent in in-context learning play a pivotal role in mitigating the impact of backdoor triggers, effectively recalibrating the behavior of compromised models. Our experimental analysis demonstrates that this method robustly defends against both instance-level and instruction-level backdoor attacks, outperforming existing defense baselines across most evaluation scenarios.
