Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Tianhang Zhao; Wei Du; Haodong Zhao; Sufeng Duan; Gongshen Liu

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Tianhang Zhao, Wei Du, Haodong Zhao, Sufeng Duan, Gongshen Liu

TL;DR

Patronus tackles transferable backdoors in PLMs by shifting defense from the output space to the input space, employing a multi-trigger comparative search that fuses gradient-guided optimization with contrastive learning. The approach unfolds in detection, verification, and purification stages, featuring input-space trigger inversion, a word-level embedding mapping, and dual-stage purification via input filtering and adversarial training. Extensive experiments across 15 PLMs and 10 tasks show near-perfect backdoor recall and substantial ASR reduction, outperforming state-of-the-art baselines and generalizing to LLMs. The work provides a practical, scalable defense framework for safeguarding PLM supply chains against transferable backdoors and outlines avenues for handling adaptive attacks.

Abstract

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

TL;DR

Abstract

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)