HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis
Mengtian Li, Jinshu Chen, Wanquan Feng, Bingchuan Li, Fei Dai, Songtao Zhao, Qian He
TL;DR
HyperLoRA presents a parameter-efficient, zero-shot approach to personalized portrait synthesis by generating LoRA weights through an adaptive plug-in network. It operating in a low-dimensional linear LoRA space and explicitly decomposes LoRA into a Base-LoRA (background and clothing) and an ID-LoRA (identity) component, enabling strong identity fidelity while maintaining editability. A multi-stage training regime—Base-LoRA warm-up, CLIP-guided ID-LoRA, and ID embedding fine-tuning—minimizes overfitting and leverages frozen encoders (CLIP ViT and AntelopeV2) to preserve realism. Empirical results on a portrait-focused dataset demonstrate that HyperLoRA offers superior fidelity with competitive editability and supports multi-input and interpolation capabilities, all without online fine-tuning, delivering practical zero-shot personalized portrait generation with high realism.
Abstract
Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
