Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation

Shih-Ying Yeh; Yu-Guan Hsieh; Zhidong Gao; Bernard B W Yang; Giyeong Oh; Yanmin Gong

Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation

Shih-Ying Yeh, Yu-Guan Hsieh, Zhidong Gao, Bernard B W Yang, Giyeong Oh, Yanmin Gong

TL;DR

This paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) [https://github.com/KohakuBlueleaf/LyCORIS], an open-source library that offers a wide selection of fine-tuning methodologies for St stable Diffusion, and presents a thorough framework for the systematic assessment of varied fine- Tuning techniques.

Abstract

Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) [https://github.com/KohakuBlueleaf/LyCORIS], an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.

Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation

TL;DR

Abstract

Paper Structure (98 sections, 1 theorem, 30 equations, 41 figures, 7 tables, 1 algorithm)

This paper contains 98 sections, 1 theorem, 30 equations, 41 figures, 7 tables, 1 algorithm.

Introduction
Preliminary
Stable Diffusion
Model Customization With
Low-Rank Adaptation (LoRA)
The LyCORIS Library
Design and Objectives
Implemented Algorithms
LoRA (LoCon)
Others
Evaluating Fine-Tuned Text-To-Image Models
Classification of Prompts for Image Generation
Evaluation Criteria
...and 83 more sections

Key Result

theorem 1

Assume that we train a neural network with forward pass modified as in eq:forward-pass-modif and that every $\vl[\dec]$ is homogeneous, for all $\csta\in\R$ and all possible input $\vlg[\decmat][\layer][1], \ldots, \vlg[\decmat][\layer][\vl[\ndecs]]$, we have Then, replacing $\vl[\scale]$ by $1$ in eq:forward-pass-modif, scaling the initialization parameters and learning rate of each layer $\lay

Figures (41)

Figure 1: This figure shows the structure of the proposed Loha and Lokr modules implemented in .
Figure 2: SHAP beeswarm charts and scatter plots for analyzing the impact of change in different algorithm components. In the beeswarm plots, is in blue, is in purple, is in purple red, and is in red. Model capacity is adjusted by either increasing dimension (for or ) or decreasing factor (for ). In the scatter plots, SCD indicates that we use squared centroid distance to measure image similarity. This removes the implicit penalization towards more diverse image sets in the computation of average cosine similarity (see \ref{['apx:metrics']} for details). We believe it is thus more suitable when we are interested in the trade-off between fidelity and diversity. The error bars in the scatter plots represent standard errors of the metric values across random seeds and classes.
Figure 3: Qualitative comparison of checkpoints trained with different configurations. Samples of the top row are generated using only concept descriptors while samples of the bottom row are generated with the two prompts "[$V_{\text{castle}}$] scene stands against a backdrop of snow-capped mountains" and "[$V_{\text{castle}}$] scene surrounded by a lush, vibrant forest". The number of training epochs is chosen according to the concept category.
Figure 4: This figure shows how to represent as two or three linear layers (depending on whether we perform the additional low-rank decomposition or not).
Figure 5: Pearson correlation for metrics computed using different encoders and resizing methods, evaluated on images generated from different types of prompts (best viewed when zoomed in).
...and 36 more figures

Theorems & Definitions (3)

theorem 1
proof
remark 1

Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation

TL;DR

Abstract

Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (41)

Theorems & Definitions (3)