Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

Sanhanat Sivapiromrat; Caiqi Zhang; Marco Basaldella; Nigel Collier

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

Sanhanat Sivapiromrat, Caiqi Zhang, Marco Basaldella, Nigel Collier

TL;DR

The paper addresses the vulnerability of LLMs to data poisoning by introducing and analyzing multi-trigger backdoors. It shows that multiple triggers with similar embeddings can coexist and reinforce each other, maintaining high attack success even under token substitutions and long-range token gaps, thereby expanding the attack surface. The study provides a framework for understanding trigger interactions, including the roles of embedding proximity and token order, and proposes a post hoc defense based on weight-difference analysis that selectively retrains MLP and embedding components to remove backdoor behaviour with limited parameter updates. This defense strategy offers a practical path to mitigating complex backdoors in LLMs, highlighting the need for robust security measures as LLM deployment scales.

Abstract

Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack's effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

TL;DR

Abstract

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)