Ada-adapter:Fast Few-shot Style Personlization of Diffusion Model with Pre-trained Image Encoder

Jia Liu; Changlin Li; Qirui Sun; Jiahui Ming; Chen Fang; Jue Wang; Bing Zeng; Shuaicheng Liu

Ada-adapter:Fast Few-shot Style Personlization of Diffusion Model with Pre-trained Image Encoder

Jia Liu, Changlin Li, Qirui Sun, Jiahui Ming, Chen Fang, Jue Wang, Bing Zeng, Shuaicheng Liu

TL;DR

The paper tackles the high data and compute costs of diffusion-model style transfer by introducing Ada-Adapter, a framework that fuses a pre-trained image encoder with off-the-shelf diffusion models to enable zero-shot and few-shot style personalization. It leverages a hierarchical, layer-wise conditioning strategy to disentangle style from content and to balance image priors with text prompts, using multi-modal fine-tuning with LoRA. Empirical results on 16 style datasets show Ada-Adapter delivers superior stylization quality and text alignment while requiring only 3–5 reference images and minutes of training, outperforming existing zero-shot and few-shot baselines. The approach significantly lowers practical barriers to diffusion-based style personalization, enabling rapid, stable, and scalable customization for creators and practitioners.

Abstract

Fine-tuning advanced diffusion models for high-quality image stylization usually requires large training datasets and substantial computational resources, hindering their practical applicability. We propose Ada-Adapter, a novel framework for few-shot style personalization of diffusion models. Ada-Adapter leverages off-the-shelf diffusion models and pre-trained image feature encoders to learn a compact style representation from a limited set of source images. Our method enables efficient zero-shot style transfer utilizing a single reference image. Furthermore, with a small number of source images (three to five are sufficient) and a few minutes of fine-tuning, our method can capture intricate style details and conceptual characteristics, generating high-fidelity stylized images that align well with the provided text prompts. We demonstrate the effectiveness of our approach on various artistic styles, including flat art, 3D rendering, and logo design. Our experimental results show that Ada-Adapter outperforms existing zero-shot and few-shot stylization methods in terms of output quality, diversity, and training efficiency.

Ada-adapter:Fast Few-shot Style Personlization of Diffusion Model with Pre-trained Image Encoder

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 11 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Text to image diffusion models
Diffusion-based stylization
Method
Preliminaries: Diffusion model
Visual modality condition for stylization
Hierarchical Adapter
Few-shot style personalization
Experiments and evaluations
Qualitative comparison
Quantitative evaluation
Ablation study
The number of reference images
The necessity for hierarchical scales
...and 2 more sections

Figures (11)

Figure 1: Results of our method for one-shot style transfer within 100 training steps and several minutes of fine-tuning.
Figure 2: The stylization result of LoRAs trained with datasets of different sizes. We demonstrate the deterioration of the stylization quality and text alignment ability when the number of images $N$ ranges from 5 to 20 for training a style LoRA. The style generalization and text alignment ability becomes worse when fewer images are used.
Figure 3: An overview of our method pipeline. We use a pre-trained image encoder to enable multi-modal denoising. Our pipeline consists of two main parts. The first part is on the left of the figure, where we perform inference processes with the fixed image encoder and different reference inputs. We record the intermediate attention features to compute the hierarchical scales for the image encoder. The hierarchical scales are applied to the image encoder to better disentangle style and subject features. The second part is on the right of the figure, where we fine-tune the diffusion model using LoRA modules with image conditions.
Figure 4: The effectiveness of hierarchical scales. We perform denoising process based on SDXL podell2023sdxl. On the left of this figure are input reference images and text prompts, and on the right of the figure are zero-shot stylized results. We set IP-Adapter with various scales, and to be specific, when the scale directly is set to $\lambda$, the output features of all layers in IP-Adapter are scaled by $\lambda$. In contrast, our method assigns unique scales to individual layers, effectively preserving the stylistic integrity of the reference images while concurrently mitigating semantic discrepancies.
Figure 5: The outcomes of our method for both zero-shot and few-shot style transfer. The figure presents, on the left, three reference images that exemplify a flat, exaggerated character illustration style. On the right, we illustrate the results of our zero-shot method, which adeptly replicates the flat art style and textures. However, it does not fully preserve the exaggerated characteristics inherent to the style, a feat our few-shot method accomplishes with greater fidelity.
...and 6 more figures

Ada-adapter:Fast Few-shot Style Personlization of Diffusion Model with Pre-trained Image Encoder

TL;DR

Abstract

Ada-adapter:Fast Few-shot Style Personlization of Diffusion Model with Pre-trained Image Encoder

Authors

TL;DR

Abstract

Table of Contents

Figures (11)