Table of Contents
Fetching ...

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Tianhang Zhao, Wei Du, Haodong Zhao, Sufeng Duan, Gongshen Liu

TL;DR

Patronus tackles transferable backdoors in PLMs by shifting defense from the output space to the input space, employing a multi-trigger comparative search that fuses gradient-guided optimization with contrastive learning. The approach unfolds in detection, verification, and purification stages, featuring input-space trigger inversion, a word-level embedding mapping, and dual-stage purification via input filtering and adversarial training. Extensive experiments across 15 PLMs and 10 tasks show near-perfect backdoor recall and substantial ASR reduction, outperforming state-of-the-art baselines and generalizing to LLMs. The work provides a practical, scalable defense framework for safeguarding PLM supply chains against transferable backdoors and outlines avenues for handling adaptive attacks.

Abstract

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

TL;DR

Patronus tackles transferable backdoors in PLMs by shifting defense from the output space to the input space, employing a multi-trigger comparative search that fuses gradient-guided optimization with contrastive learning. The approach unfolds in detection, verification, and purification stages, featuring input-space trigger inversion, a word-level embedding mapping, and dual-stage purification via input filtering and adversarial training. Extensive experiments across 15 PLMs and 10 tasks show near-perfect backdoor recall and substantial ASR reduction, outperforming state-of-the-art baselines and generalizing to LLMs. The work provides a practical, scalable defense framework for safeguarding PLM supply chains against transferable backdoors and outlines avenues for handling adaptive attacks.

Abstract

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.

Paper Structure

This paper contains 52 sections, 14 equations, 12 figures, 15 tables, 2 algorithms.

Figures (12)

  • Figure 1: Transferable backdoor attacks against PLMs.
  • Figure 2: Pipeline for $\mathsf{Patronus}$. In the backdoor detection phase, the suspicious model undergoes backdoor trigger inversion based on the propose multi-trigger contrastive search algorithm. The backdoor verification phase involves analyzing and validating candidate triggers, while the backdoor cleanse phase purifies the backdoored model.
  • Figure 3: Visualization of output representations in Clean Models (CM) and Backdoor Models (BM).
  • Figure 4: Two adversarial training process for model purification.
  • Figure 5: Parameter studies about trigger search.
  • ...and 7 more figures