Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano; Ekaterina Vasyagina; Raghav Dixit; Kevin Zhu; Ruizhe Li; Javier Ferrando; Maheep Chaudhary

Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li, Javier Ferrando, Maheep Chaudhary

TL;DR

This work detects poisoned adapters by analyzing their weight matrices directly, without running the model -- making the method data-agnostic, and achieves 97% detection accuracy with less than 2% false positives.

Abstract

LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method data-agnostic. Our method extracts simple statistics -- how concentrated the singular values are, their entropy, and the distribution shape -- and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97\% detection accuracy with less than 2\% false positives.

Weight space Detection of Backdoors in LoRA Adapters

TL;DR

Abstract

Paper Structure (18 sections, 5 equations, 4 figures, 2 tables)

This paper contains 18 sections, 5 equations, 4 figures, 2 tables.

Introduction
Related Work
Method
Weight Extraction.
Metrics.
Z-Score Normalization.
Detection Pipeline.
Experiments
Setup.
Configuration.
Results.
Detection Performance.
Limitations
Conclusion
Appendix
...and 3 more sections

Figures (4)

Figure 1: Overview of the backdoor detection pipeline. Given a LoRA adapter, we extract weight matrices and compute $\Delta W = BA$. We sum updates across attention projections and perform SVD to obtain singular values. From these, we compute five spectral metrics: leading singular value, Frobenius norm, energy concentration, spectral entropy, and kurtosis. Each metric is z-score normalized against a reference bank of benign adapters. A logistic regression classifier combines scores to flag adapters exceeding threshold $\tau$ as backdoored—all without model execution.
Figure 2: Score distributions for benign (green) and poisoned (red) adapters on the held-out test set. The threshold $\tau = 0.718$ achieves 97% detection accuracy with clear separation. See Appendix \ref{['sec:calibration']} for calibration score distribution.
Figure 3: Layer Choice Support Data
Figure 4: Calibration score distribution for benign (green) and poisoned (red) adapters. The threshold $\tau = 0.718$ was selected to maximize poison detection rate.

Weight space Detection of Backdoors in LoRA Adapters

TL;DR

Abstract

Weight space Detection of Backdoors in LoRA Adapters

Authors

TL;DR

Abstract

Table of Contents

Figures (4)