Table of Contents
Fetching ...

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li, Nay Myat Min, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Yu-Gang Jiang, Jun Sun

TL;DR

The findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems.

Abstract

Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation $(T, A, U)$, representing the \emph{Trigger}, \emph{Activation mechanism}, and \emph{Utility function}, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G.

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

TL;DR

The findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems.

Abstract

Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation , representing the \emph{Trigger}, \emph{Activation mechanism}, and \emph{Utility function}, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G.
Paper Structure (26 sections, 3 equations, 5 figures, 4 tables)

This paper contains 26 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our B4G framework for beneficial behavior (e.g. enhancing safety alignment) in LLMs. A beneficial backdoor module is learned during training and conditionally activated at inference through a secret trigger key. This design transforms backdoor mechanisms as safety and beneficial system primitives.
  • Figure 2: Radar-plot of our B4G across models and tasks. Each panel compares the original baseline model (blue) and the LoRA-tuned B4G model (orange dashed) on six axes: $\mathrm{TAR}_w$ and five utility metrics (TruthfulQA, MT-Bench, MNLI, RTE, SST-2). TruthfulQA and MT-Bench scores are normalized to $[0,1]$ by dividing by $10$, and all GLUE metrics are accuracy on all axes.
  • Figure 3: Persistence analysis of conditional behaviors under different post-training adaptations. We compare the trigger activation rate (TAR$_w$) of B4G behaviors learned via LoRA fine-tuning with their persistence after subsequent downstream fine-tuning. The left panel shows instruction-style Dolly fine-tuning (in-distribution), while the right panel shows code-oriented fine-tuning (out-of-distribution), highlighting how conditional behaviors can be selectively preserved or attenuated under different adaptation regimes.
  • Figure 4: Multi-trigger compatibility results under a multi-task setting. We report trigger activation rates without (TAR$_{w/o}$) and with (TAR$_w$) the corresponding trigger, measuring whether each conditional behavior can be selectively activated in the presence of other triggers.
  • Figure 5: Trigger sensitive of B4G across models and configurations. Top: TAR$_w$ under different numbers of trigger samples. Bottom: TAR$_w$ under varying trigger lengths.