Table of Contents
Fetching ...

Stylus: Automatic Adapter Selection for Diffusion Models

Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica

TL;DR

Stylus tackles the challenge of selecting and composing numerous open-source adapters for diffusion models by a three-stage pipeline (Refiner, Retriever, Composer) that converts adapters into descriptive embeddings, retrieves relevance from the full prompt, and assigns adapters to prompt keywords. The authors introduce StylusDocs (75K LoRAs) for scalable evaluation and demonstrate improved visual fidelity, textual alignment, and image diversity, achieving a more Pareto-efficient CLIP-FID profile and higher human/VLM preferences compared to base Stable Diffusion. Key contributions include a language-model–driven composer that grounds adapters to prompt tasks, a masking scheme to diversify outputs, and empirical validation across text-to-image and image-to-image tasks. The work enables practical, scalable adapter augmentation in generative imagery with potential for broader domain adaptation and tooling in open-source diffusion ecosystems.

Abstract

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more.

Stylus: Automatic Adapter Selection for Diffusion Models

TL;DR

Stylus tackles the challenge of selecting and composing numerous open-source adapters for diffusion models by a three-stage pipeline (Refiner, Retriever, Composer) that converts adapters into descriptive embeddings, retrieves relevance from the full prompt, and assigns adapters to prompt keywords. The authors introduce StylusDocs (75K LoRAs) for scalable evaluation and demonstrate improved visual fidelity, textual alignment, and image diversity, achieving a more Pareto-efficient CLIP-FID profile and higher human/VLM preferences compared to base Stable Diffusion. Key contributions include a language-model–driven composer that grounds adapters to prompt tasks, a masking scheme to diversify outputs, and empirical validation across text-to-image and image-to-image tasks. The work enables practical, scalable adapter augmentation in generative imagery with potential for broader domain adaptation and tooling in open-source diffusion ecosystems.

Abstract

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more.
Paper Structure (33 sections, 1 equation, 14 figures, 5 tables)

This paper contains 33 sections, 1 equation, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Stylus algorithm. Stylus consists of three stages. The refiner plugs an adapter's model card through a VLM to generate textual descriptions of an adapter's task and then through an encoder to produce the corresponding text embedding. The retriever fetches candidate adapters that are relevant to the entire user prompt. Finally, the composer prunes and jointly categorizes the remaining adapters based on the prompt's tasks, which correspond to a set of keywords.
  • Figure 2: Number of Adapters. Civit AI boasts 100K+ adapters for Stable Diffusion, outpacing that of Hugging Face. Low-Rank Adaptation (LoRA) is the dominant approach for finetuning.
  • Figure 3: Qualitative comparison between Stylus over realistic (left) and cartoon (right) style Stable Diffusion checkpoints. Stylus produces highly detailed images that correctly depicts keywords in the context of the prompt. For the prompt "A graffiti of a corgi on the wall", our method correctly depicts a spray-painted corgi, whereas the checkpoint generates a realistic dog.
  • Figure 4: Human Evaluation. Stylus achieves a higher preference scores (2:1) over different datasets and Stable Diffusion checkpoints.
  • Figure 5: Clip/FID Pareto Curve for COCO. We observe Stylus can improve visual fidelity (FID) and/or textual alignment (CLIP) over a range of guidance values (CFG): [1, 1.5, 2, 3, 4, 6, 9, 12].
  • ...and 9 more figures