Table of Contents
Fetching ...

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Marcel Mateos Salles, Praney Goyal, Pradyut Sekhsaria, Hai Huang, Randall Balestriero

TL;DR

The paper demonstrates that LoRA finetuning is vulnerable to Seamless Spurious Token Injection (SSTI), wherein a single spurious token can deterministically steer model predictions. It introduces a formal SSTI framework, conducts large-scale experiments across model families and datasets, and reveals a non-monotonic relationship between LoRA capacity and robustness depending on SSTI strength. Attention-entropy emerges as a practical diagnostic for detecting SSTI-driven shortcuts, while standard data-cleaning and grammar tools prove ineffective in fully mitigating these risks; paraphrasing offers partial defense with token-type dependent retention. The work underscores a critical tradeoff between efficiency and robustness in PEFT and provides a usable SSTI toolkit to evaluate and improve data quality and AI safety in finetuning pipelines.

Abstract

Large Language Models (LLMs) are commonly finetuned for a variety of use cases and domains. A common approach is to leverage Low-Rank Adaptation (LoRA) -- known to provide strong performance at low resource costs. In this study, we demonstrate that LoRA actually opens the door to short-cut vulnerabilities -- and the more resource efficient is the LoRA setup, the more vulnerable will be the finetuned model to aggressive attacks. To measure that vulnerability, we introduce Seamless Spurious Token Injection (SSTI), where we find that LoRA exclusively focuses on even just a single token that is spuriously correlated with downstream labels. In short, injection of that spurious token during finetuning ensure that the model's prediction at test-time can be manipulated on-demand. We conducted experiments across model families and datasets to evaluate the impact of SSTI during LoRA finetuning while providing possible mitigations. Our experiments conclude that none of the existing checkers and preprocessors can sanitize a dataset raising new concerns for data quality and AI safety.

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

TL;DR

The paper demonstrates that LoRA finetuning is vulnerable to Seamless Spurious Token Injection (SSTI), wherein a single spurious token can deterministically steer model predictions. It introduces a formal SSTI framework, conducts large-scale experiments across model families and datasets, and reveals a non-monotonic relationship between LoRA capacity and robustness depending on SSTI strength. Attention-entropy emerges as a practical diagnostic for detecting SSTI-driven shortcuts, while standard data-cleaning and grammar tools prove ineffective in fully mitigating these risks; paraphrasing offers partial defense with token-type dependent retention. The work underscores a critical tradeoff between efficiency and robustness in PEFT and provides a usable SSTI toolkit to evaluate and improve data quality and AI safety in finetuning pipelines.

Abstract

Large Language Models (LLMs) are commonly finetuned for a variety of use cases and domains. A common approach is to leverage Low-Rank Adaptation (LoRA) -- known to provide strong performance at low resource costs. In this study, we demonstrate that LoRA actually opens the door to short-cut vulnerabilities -- and the more resource efficient is the LoRA setup, the more vulnerable will be the finetuned model to aggressive attacks. To measure that vulnerability, we introduce Seamless Spurious Token Injection (SSTI), where we find that LoRA exclusively focuses on even just a single token that is spuriously correlated with downstream labels. In short, injection of that spurious token during finetuning ensure that the model's prediction at test-time can be manipulated on-demand. We conducted experiments across model families and datasets to evaluate the impact of SSTI during LoRA finetuning while providing possible mitigations. Our experiments conclude that none of the existing checkers and preprocessors can sanitize a dataset raising new concerns for data quality and AI safety.

Paper Structure

This paper contains 50 sections, 1 equation, 17 figures, 31 tables.

Figures (17)

  • Figure 1: Injecting a single spurious token in an increasing proportion of the dataset (x-axis) creates a shortcut learning opportunity. LoRA finetuning (here with a rank of 1) zeroes in on that shortcut solution. The resulting LLM's behavior thus becomes only dependent on the presence or absence of the spurious tokens, resulting in performance degradations (y-axis).
  • Figure 2: Conditional Entropy for clean IMDB (left, 2 classes) and Common Sense (right, 5 classes) datasets, removing tokens that appear in less than 50 samples. Majority of tokens have a high entropy meaning that their occurrence alone is not enough to predict the prompt class $y$. More examples can be found in \ref{['entropy-fig-full']}.
  • Figure 3: Balanced accuracy under Light SSTI (Snowflake-arctic-embed-xs on IMDB) We plot model performance on clean vs. spurious evaluation sets as a function of LoRA rank, under Light SSTI (a single injected token per sample, 50% of samples injected). Error bars reflect variation across injection locations and random seeds. (Left): Balanced accuracy (↑) for clean and spurious test sets as a function of LoRA rank Minimal corruption yields high spurious accuracy, revealing strong reliance on the injected token.(Right): Accuracy degradation (↓) (spurious minus clean) across LoRA ranks for various training injection proportions. As the proportion of injected samples increases, higher LoRA ranks lead to larger gaps—amplifying shortcut reliance.
  • Figure 4: Examples of spurious token injection (SSTI) strategies. Injected tokens are highlighted in red. Top: Original sentence without corruption. Next rows: A single token (date) is inserted at the beginning; multiple random tokens are injected at random positions; and HTML tags are inserted at the end. These patterns mimic real-world artifacts and are sufficient to steer model predictions. Our full evaluation systematically varies token type, number, and injection location (start, end, random). Additional examples in \ref{['sec:spurious token injection examples']}.
  • Figure 5: Code demonstrating a basic use of our library to inject a randomly generated date token into a basic sentence. For further examples using the code library refer to the rest of examples in this \ref{['code-ex-2']}
  • ...and 12 more figures