Table of Contents
Fetching ...

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Chen Chen, Yuchen Sun, Jiaxin Gao, Yanwen Jia, Xueluan Gong, Qian Wang, Kwok-Yan Lam

TL;DR

ProtoPurify addresses the practical need for scalable backdoor defenses in LLMs by learning a transferable backdoor prototype in weight space from simulated attacks. It localizes backdoor vessels to a boundary layer and applies targeted, controllable purification via SVD-based suppression of prototype-aligned components, enabling BDaaS-ready deployment. Across two LLMs and multiple attack types, ProtoPurify achieves strong mitigation with ASR often under 10% while preserving CDA and maintaining robustness under adaptive threats. The approach emphasizes reusability, customizability, interpretability, and runtime efficiency, promising scalable deployment in security-conscious settings.

Abstract

Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

TL;DR

ProtoPurify addresses the practical need for scalable backdoor defenses in LLMs by learning a transferable backdoor prototype in weight space from simulated attacks. It localizes backdoor vessels to a boundary layer and applies targeted, controllable purification via SVD-based suppression of prototype-aligned components, enabling BDaaS-ready deployment. Across two LLMs and multiple attack types, ProtoPurify achieves strong mitigation with ASR often under 10% while preserving CDA and maintaining robustness under adaptive threats. The approach emphasizes reusability, customizability, interpretability, and runtime efficiency, promising scalable deployment in security-conscious settings.

Abstract

Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.
Paper Structure (23 sections, 18 equations, 6 figures, 11 tables, 4 algorithms)

This paper contains 23 sections, 18 equations, 6 figures, 11 tables, 4 algorithms.

Figures (6)

  • Figure 1: Cosine similarities among backdoor vectors (B vs B) and between backdoor and clean vectors (B vs C). The backdoor vectors are extracted from 5 trigger types and 4 different attacks.
  • Figure 2: An overview of ProtoPurify. In Stage I, we simulate diverse backdoor scenarios to extract backdoor vectors from clean vs. backdoored model weights. These vectors are aggregated into prototype vectors, from which the most aligned prototype is selected for the target backdoor model in Stage II. Stage III aims to detect a boundary layer via layer-wise prototype alignment. In Stage IV, we purify candidate layers by suppressing prototype-aligned components for each matrix.
  • Figure 3: Effect of Boundary Layer Detection.
  • Figure 4: Effect of purification strength $\alpha$ .
  • Figure 5: Layer-wise alignment scores
  • ...and 1 more figures