Table of Contents
Fetching ...

HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis

Mengtian Li, Jinshu Chen, Wanquan Feng, Bingchuan Li, Fei Dai, Songtao Zhao, Qian He

TL;DR

HyperLoRA presents a parameter-efficient, zero-shot approach to personalized portrait synthesis by generating LoRA weights through an adaptive plug-in network. It operating in a low-dimensional linear LoRA space and explicitly decomposes LoRA into a Base-LoRA (background and clothing) and an ID-LoRA (identity) component, enabling strong identity fidelity while maintaining editability. A multi-stage training regime—Base-LoRA warm-up, CLIP-guided ID-LoRA, and ID embedding fine-tuning—minimizes overfitting and leverages frozen encoders (CLIP ViT and AntelopeV2) to preserve realism. Empirical results on a portrait-focused dataset demonstrate that HyperLoRA offers superior fidelity with competitive editability and supports multi-input and interpolation capabilities, all without online fine-tuning, delivering practical zero-shot personalized portrait generation with high realism.

Abstract

Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.

HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis

TL;DR

HyperLoRA presents a parameter-efficient, zero-shot approach to personalized portrait synthesis by generating LoRA weights through an adaptive plug-in network. It operating in a low-dimensional linear LoRA space and explicitly decomposes LoRA into a Base-LoRA (background and clothing) and an ID-LoRA (identity) component, enabling strong identity fidelity while maintaining editability. A multi-stage training regime—Base-LoRA warm-up, CLIP-guided ID-LoRA, and ID embedding fine-tuning—minimizes overfitting and leverages frozen encoders (CLIP ViT and AntelopeV2) to preserve realism. Empirical results on a portrait-focused dataset demonstrate that HyperLoRA offers superior fidelity with competitive editability and supports multi-input and interpolation capabilities, all without online fine-tuning, delivering practical zero-shot personalized portrait generation with high realism.

Abstract

Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.

Paper Structure

This paper contains 21 sections, 3 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: We propose HyperLoRA, a parameter-efficient adaptive method for portrait synthesis. Given an input face image, HyperLoRA generates personalized LoRA weights without online fine-tuning. Due to the natural interpolability of LoRA, it is easy to support multiple inputs by simple averaging. Leveraging the generated LoRA, we can create personalized portraits with high photorealism and fidelity.
  • Figure 2: Overview of HyperLoRA. We explicitly decompose the HyperLoRA into a Hyper ID-LoRA and a Hyper Base-LoRA. The former is designed to learn ID information while the latter is expected to fit others, e.g. background and clothing. Such a design helps to prevent irreverent features leaking to ID-LoRA. During the training, we fix the weights of the pretrained SDXL base model and encoders, allowing only HyperLoRA modules updated by Backpropagation. At the inference stage, the Hyper ID-LoRA integrated into SDXL generates personalized images while the Hyper Base-LoRA is optional.
  • Figure 3: The identity reconstruction ability on the low-dimensional linear LoRA space. In this example, we project LoRA parameters onto 128-dim basis and train on a face dataset (about 400K samples). Compared to normal LoRA, our compressed LoRA can also maintain the identity of the reference image well.
  • Figure 4: Network structure of HyperLoRA. We apply a perceiver resampler to convert the image features into a group of LoRA coefficients, thereby generating the whole LoRA by multiplied with LoRA basis. Two independent perceiver resamplers are instantiated for Hyper Base and ID LoRAs. Note that the second attention block (interacting with green tokens from ID Projector) is absent in Hyper Base-LoRA.
  • Figure 5: Adapter and LoRA show different tolerances to CFG. The first row: images generated by InstantID where higher CFG leads to oversaturation. The second row: HyperLoRA always yield reasonable portrait images for CFG ranged from $3$ to $7$.
  • ...and 10 more figures