Table of Contents
Fetching ...

LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

Ethan Smith, Rami Seid, Alberto Hojel, Paramita Mishra, Jianbo Wu

TL;DR

Addressing the need for fast, zero-shot personalization of diffusion models, this work defines a low-dimensional manifold $M\subset \mathbb{R}^N$ of LoRA parameters with dimension $R \ll N$ and trains a generative model to sample new LoRAs conditioned on domain cues. It introduces a VAE-based latent encoding of LoRA vectors (with $m=512$) and leverages diffusion with $x_0$- and $v$-predictions, showing that VAE latents and Gaussian priors yield superior reconstruction and conditioning fidelity. The proposed ADALoRA conditioning mechanism further improves attribute control, achieving about a 30% gain in ArcFace similarity over AdaNorm. Together, these components enable near-instantaneous LoRA synthesis for personalized diffusion outputs, reducing training costs while preserving identity fidelity and extending rapid adaptation to broader content domains.

Abstract

Low-Rank Adaptation (LoRA) and other parameter-efficient fine-tuning (PEFT) methods provide low-memory, storage-efficient solutions for personalizing text-to-image models. However, these methods offer little to no improvement in wall-clock training time or the number of steps needed for convergence compared to full model fine-tuning. While PEFT methods assume that shifts in generated distributions (from base to fine-tuned models) can be effectively modeled through weight changes in a low-rank subspace, they fail to leverage knowledge of common use cases, which typically focus on capturing specific styles or identities. Observing that desired outputs often comprise only a small subset of the possible domain covered by LoRA training, we propose reducing the search space by incorporating a prior over regions of interest. We demonstrate that training a hypernetwork model to generate LoRA weights can achieve competitive quality for specific domains while enabling near-instantaneous conditioning on user input, in contrast to traditional training methods that require thousands of steps.

LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

TL;DR

Addressing the need for fast, zero-shot personalization of diffusion models, this work defines a low-dimensional manifold of LoRA parameters with dimension and trains a generative model to sample new LoRAs conditioned on domain cues. It introduces a VAE-based latent encoding of LoRA vectors (with ) and leverages diffusion with - and -predictions, showing that VAE latents and Gaussian priors yield superior reconstruction and conditioning fidelity. The proposed ADALoRA conditioning mechanism further improves attribute control, achieving about a 30% gain in ArcFace similarity over AdaNorm. Together, these components enable near-instantaneous LoRA synthesis for personalized diffusion outputs, reducing training costs while preserving identity fidelity and extending rapid adaptation to broader content domains.

Abstract

Low-Rank Adaptation (LoRA) and other parameter-efficient fine-tuning (PEFT) methods provide low-memory, storage-efficient solutions for personalizing text-to-image models. However, these methods offer little to no improvement in wall-clock training time or the number of steps needed for convergence compared to full model fine-tuning. While PEFT methods assume that shifts in generated distributions (from base to fine-tuned models) can be effectively modeled through weight changes in a low-rank subspace, they fail to leverage knowledge of common use cases, which typically focus on capturing specific styles or identities. Observing that desired outputs often comprise only a small subset of the possible domain covered by LoRA training, we propose reducing the search space by incorporating a prior over regions of interest. We demonstrate that training a hypernetwork model to generate LoRA weights can achieve competitive quality for specific domains while enabling near-instantaneous conditioning on user input, in contrast to traditional training methods that require thousands of steps.

Paper Structure

This paper contains 18 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Samples generated from LoRA-adapted Stable Diffusion, where LoRAs are generated by a hypernetwork taking faces as input conditions. Cropped faces show the reference image, and the paired image on the right shows the generated sample.
  • Figure 2: Design of LoRA Diffusion. A frozen VAE encoder is used to encode LoRAs into a latent space of reduced dimensionality. In a training step, gaussian noise is applied to a latent sample, and the MLP is tasked with predicting the denoised LoRA given the ArcFace embedding condition. At inference time, this process begins with a gaussian noise sample and iteratively denoised to produce a latent which is then decoded to a novel LoRA.
  • Figure 3: Cumulative explained variance ratio for the top 10,000 principal components of the LoRA latent space, highlighting the diminishing returns of additional components.
  • Figure 4: (a) VAE architecture diagram. LoRA weights are flattened into one large vector and fed through sequential MLPs of progressively decreasing dimensions in the encoder, and expanded back to original size in the decoder.
  • Figure 5: Comparison of diffusion model performance on LoRA vectors vs. VAE latents. (a) Loss curve (b) Validation similarity with arcface embeddings
  • ...and 1 more figures