Table of Contents
Fetching ...

Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li, Javier Ferrando, Maheep Chaudhary

TL;DR

This work detects poisoned adapters by analyzing their weight matrices directly, without running the model -- making the method data-agnostic, and achieves 97% detection accuracy with less than 2% false positives.

Abstract

LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method data-agnostic. Our method extracts simple statistics -- how concentrated the singular values are, their entropy, and the distribution shape -- and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97\% detection accuracy with less than 2\% false positives.

Weight space Detection of Backdoors in LoRA Adapters

TL;DR

This work detects poisoned adapters by analyzing their weight matrices directly, without running the model -- making the method data-agnostic, and achieves 97% detection accuracy with less than 2% false positives.

Abstract

LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method data-agnostic. Our method extracts simple statistics -- how concentrated the singular values are, their entropy, and the distribution shape -- and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97\% detection accuracy with less than 2\% false positives.
Paper Structure (18 sections, 5 equations, 4 figures, 2 tables)

This paper contains 18 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the backdoor detection pipeline. Given a LoRA adapter, we extract weight matrices and compute $\Delta W = BA$. We sum updates across attention projections and perform SVD to obtain singular values. From these, we compute five spectral metrics: leading singular value, Frobenius norm, energy concentration, spectral entropy, and kurtosis. Each metric is z-score normalized against a reference bank of benign adapters. A logistic regression classifier combines scores to flag adapters exceeding threshold $\tau$ as backdoored—all without model execution.
  • Figure 2: Score distributions for benign (green) and poisoned (red) adapters on the held-out test set. The threshold $\tau = 0.718$ achieves 97% detection accuracy with clear separation. See Appendix \ref{['sec:calibration']} for calibration score distribution.
  • Figure 3: Layer Choice Support Data
  • Figure 4: Calibration score distribution for benign (green) and poisoned (red) adapters. The threshold $\tau = 0.718$ was selected to maximize poison detection rate.