Table of Contents
Fetching ...

Finetuning-Free Personalization of Text to Image Generation via Hypernetworks

Sagar Shrestha, Gopal Sharma, Luowei Zhou, Suren Kumar

TL;DR

The paper tackles the problem of personalizing text-to-image diffusion with minimal overhead by introducing an end-to-end hypernetwork that predicts LoRA adapters directly from subject images for a frozen diffusion backbone. A simple $L_2$ regularization on the hypernetwork output stabilizes training and prevents overfitting, enabling reliable per-subject personalization without test-time optimization. It further proposes Hybrid Model Classifier-Free Guidance (HM-CFG) to combine the base model's generalization with the subject-specific fidelity during sampling, improving prompt compliance while preserving subject details. Comprehensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench show state-of-the-art results among tuning-free methods and substantial speedups over DreamBooth-style fine-tuning. Collectively, the approach offers a scalable, open-category personalization pathway with strong subject fidelity and controllable prompt adherence.

Abstract

Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.

Finetuning-Free Personalization of Text to Image Generation via Hypernetworks

TL;DR

The paper tackles the problem of personalizing text-to-image diffusion with minimal overhead by introducing an end-to-end hypernetwork that predicts LoRA adapters directly from subject images for a frozen diffusion backbone. A simple regularization on the hypernetwork output stabilizes training and prevents overfitting, enabling reliable per-subject personalization without test-time optimization. It further proposes Hybrid Model Classifier-Free Guidance (HM-CFG) to combine the base model's generalization with the subject-specific fidelity during sampling, improving prompt compliance while preserving subject details. Comprehensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench show state-of-the-art results among tuning-free methods and substantial speedups over DreamBooth-style fine-tuning. Collectively, the approach offers a scalable, open-category personalization pathway with strong subject fidelity and controllable prompt adherence.

Abstract

Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.

Paper Structure

This paper contains 30 sections, 23 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Overview of our approach.a) Our proposed training pipeline for hypernetwork based personalization. A frozen image encoder processes the input image, and a trainable weight decoder predicts the corresponding LoRA parameters. These parameters are then used to adapt a frozen, pre-trained text-to-image diffusion model. The hypernetwork is optimized using a composite loss function that includes both a denoising diffusion term and a regularization term on the hypernetwork's output as shown in Sec \ref{['sec:end-to-end']}. b) Our proposed inference approach using hybrid model based classifier-free guidance that combines base model and LoRA adapted model to improve compositional prompt adherence, as described in Sec \ref{['sec:hm-cfg']}.
  • Figure 2: Effect of regularization. The hypernetwork trained without regularization results in poor prompt alignment due to overfitting. Proposed regularization fixes the issue (Sec. \ref{['sec:end-to-end']}). Prompt: "a person wearing a santa hat".
  • Figure 3: Result of Dreambooth Finetuning at different steps for the prompt "a person as a top gun pilot". Early stopping is important to prevent overfitting to input subject image.
  • Figure 4: Qualitative results on CelebA-HQ dataset. Proposed method shows competitive subject and prompt fidelity compared to fine-tuning-based approach Dreambooth.
  • Figure 5: Qualitative results on Dreambench dataset. Notable improvement in subject fidelity and prompt adherence over the baselines can be observed.
  • ...and 4 more figures