Recovering the Pre-Fine-Tuning Weights of Generative Models

Eliahu Horwitz; Jonathan Kahana; Yedid Hoshen

Recovering the Pre-Fine-Tuning Weights of Generative Models

Eliahu Horwitz, Jonathan Kahana, Yedid Hoshen

TL;DR

The paper identifies a security vulnerability in LoRA-based fine-tuning where Pre-Fine-Tuning (Pre-FT) weights can be recovered from multiple LoRA-finetuned models. It introduces Spectral DeTuning, a gradient-free, unsupervised method that iteratively decomposes each fine-tuned weight into a shared pre-trained component $W$ and a low-rank residual $M_i$, optimizing $\sum_{i=1}^n \|W_i' - (W+M_i)\|^2_2$ with $\operatorname{rank}(M_i) \le r$, using alternating M- and W-steps and a rank scheduler. The authors validate the approach across ViT, Stable Diffusion, and Mistral, showing near-perfect semantic and numerical recovery of Pre-FT weights, and they introduce LoWRA Bench to benchmark Pre-FT weight recovery methods. The work highlights a potential risk in LoRA-based personalization and alignment pipelines, urging the development of defenses and broader safeguards while providing open evaluation infrastructure for future research. Overall, the study demonstrates a novel, data-free attack vector that can reinterpret aligned models as their unsafe pre-training versions, with significant implications for model safety, security, and policy.

Abstract

The dominant paradigm in generative modeling consists of two steps: i) pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained model with human values via fine-tuning. This practice is considered safe, as no current method can recover the unsafe, pre-fine-tuning model weights. In this paper, we demonstrate that this assumption is often false. Concretely, we present Spectral DeTuning, a method that can recover the weights of the pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In contrast to previous attacks that attempt to recover pre-fine-tuning capabilities, our method aims to recover the exact pre-fine-tuning weights. Our approach exploits this new vulnerability against large-scale models such as a personalized Stable Diffusion and an aligned Mistral.

Recovering the Pre-Fine-Tuning Weights of Generative Models

TL;DR

and a low-rank residual

, optimizing

with

, using alternating M- and W-steps and a rank scheduler. The authors validate the approach across ViT, Stable Diffusion, and Mistral, showing near-perfect semantic and numerical recovery of Pre-FT weights, and they introduce LoWRA Bench to benchmark Pre-FT weight recovery methods. The work highlights a potential risk in LoRA-based personalization and alignment pipelines, urging the development of defenses and broader safeguards while providing open evaluation infrastructure for future research. Overall, the study demonstrates a novel, data-free attack vector that can reinterpret aligned models as their unsafe pre-training versions, with significant implications for model safety, security, and policy.

Abstract

Paper Structure (40 sections, 9 equations, 17 figures, 12 tables, 2 algorithms)

This paper contains 40 sections, 9 equations, 17 figures, 12 tables, 2 algorithms.

Introduction
Related Works
Model Fine-tuning
Model Safety and Security
Model Red-Teaming and Adversarial Attacks
Preliminaries - LoRA
Problem Definition
Spectral DeTuning
Optimization Objective
Pre-FT Weight Recovery Algorithm
Rank Scheduler
LoRA Rank Estimation
LoWRA Bench
Dataset
Numeric Evaluation Metrics
...and 25 more sections

Figures (17)

Figure 1: Pre-Fine-Tuning Weight Recovery Attack Setting: We uncover a vulnerability in LoRA fine-tuned models wherein an attacker is able to undo the fine-tuning process and recover the weights of the original pre-trained model. The setting for the vulnerability is as follows: (a) The attacker only has access to $n$ different LoRA fine-tuned models. (b) The attacker assumes that all $n$ models originated from the same source model. Note: The attacker has no access to the low-rank decomposition of the fine-tuned models. (c) Using only the $n$ visible models, the attacker attempts to recover the original source model. Our method, Spectral DeTuning, can perform the attack in an unsupervised and data-free manner on real models such as Stable Diffusion and Mistral. For simplicity, we illustrate the attack on a single layer, in reality, the attack is carried out independently on all the fine-tuned layers. Best viewed in color
Figure 2: Mistral DPO Results: Our method, Spectral DeTuning, recovers the pre-fine-tuning generation capabilities with high precision, essentially undoing the DPO alignment LoRA fine-tuning. In green exact recovery, in red unrecovered words. Best viewed in color
Figure 3: Stable Diffusion Results: Spectral DeTuning recovers the Pre-Fine-Tuning images with high precision, even when using "in the wild" LoRAs, essentially reversing the personalization fine-tuning of the LoRA model
Figure 4: Motivation for the Log in W-Error: We visualize the convergence of all layers using Spectral DeTuning and the Mean LoRAs baselines. Spectral DeTuning clearly converges to a much better solution for almost all layers. Note that MSE does not summarize the convergence well as it yields the value of the poorly converging outlier layers. The W-Error better conveys the actual convergence by working in log-space. Results for a random subset of $5$ Stable Diffusion LoRAs
Figure 5: Rank Scheduler Convergence Speed: Using the rank scheduler has three benefits, i) accelerated convergence , ii) less variance between layers, and iii) higher precision convergence. Here we visualize i), see \ref{['fig:rank_sched_hist']} for a layer-wise visualization
...and 12 more figures

Recovering the Pre-Fine-Tuning Weights of Generative Models

TL;DR

Abstract

Recovering the Pre-Fine-Tuning Weights of Generative Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)