LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model
Marcel Mateos Salles, Praney Goyal, Pradyut Sekhsaria, Hai Huang, Randall Balestriero
TL;DR
The paper demonstrates that LoRA finetuning is vulnerable to Seamless Spurious Token Injection (SSTI), wherein a single spurious token can deterministically steer model predictions. It introduces a formal SSTI framework, conducts large-scale experiments across model families and datasets, and reveals a non-monotonic relationship between LoRA capacity and robustness depending on SSTI strength. Attention-entropy emerges as a practical diagnostic for detecting SSTI-driven shortcuts, while standard data-cleaning and grammar tools prove ineffective in fully mitigating these risks; paraphrasing offers partial defense with token-type dependent retention. The work underscores a critical tradeoff between efficiency and robustness in PEFT and provides a usable SSTI toolkit to evaluate and improve data quality and AI safety in finetuning pipelines.
Abstract
Large Language Models (LLMs) are commonly finetuned for a variety of use cases and domains. A common approach is to leverage Low-Rank Adaptation (LoRA) -- known to provide strong performance at low resource costs. In this study, we demonstrate that LoRA actually opens the door to short-cut vulnerabilities -- and the more resource efficient is the LoRA setup, the more vulnerable will be the finetuned model to aggressive attacks. To measure that vulnerability, we introduce Seamless Spurious Token Injection (SSTI), where we find that LoRA exclusively focuses on even just a single token that is spuriously correlated with downstream labels. In short, injection of that spurious token during finetuning ensure that the model's prediction at test-time can be manipulated on-demand. We conducted experiments across model families and datasets to evaluate the impact of SSTI during LoRA finetuning while providing possible mitigations. Our experiments conclude that none of the existing checkers and preprocessors can sanitize a dataset raising new concerns for data quality and AI safety.
