Table of Contents
Fetching ...

Fine-Tuning Stable Diffusion XL for Stylistic Icon Generation: A Comparison of Caption Size

Youssef Sultan, Jiangqin Ma, Yu-Ying Liao

TL;DR

The paper tackles how to fine-tune Stable Diffusion XL for stylistic commercial icon generation, examining how caption size and training data shape output quality. By training multiple SDXL variants with short and long prompts, as well as class images, and comparing against DALL-E 3, the study reveals that short prompts with class images often optimize objective metrics (FID/CLIP) but can misalign with human judgments, while long prompts may better capture targeted style in some cases. The findings highlight that FID and CLIP alone do not reliably reflect icon quality, especially for tiny, style-specific graphics, and underscore the need for bespoke evaluation criteria and data choices. The work emphasizes the practical impact of data composition and prompting on commercial icon generation, suggesting directions for larger datasets and more nuanced metrics to better support real-world design workflows.

Abstract

In this paper, we show different fine-tuning methods for Stable Diffusion XL; this includes inference steps, and caption customization for each image to align with generating images in the style of a commercial 2D icon training set. We also show how important it is to properly define what "high-quality" really is especially for a commercial-use environment. As generative AI models continue to gain widespread acceptance and usage, there emerge many different ways to optimize and evaluate them for various applications. Specifically text-to-image models, such as Stable Diffusion XL and DALL-E 3 require distinct evaluation practices to effectively generate high-quality icons according to a specific style. Although some images that are generated based on a certain style may have a lower FID score (better), we show how this is not absolute in and of itself even for rasterized icons. While FID scores reflect the similarity of generated images to the overall training set, CLIP scores measure the alignment between generated images and their textual descriptions. We show how FID scores miss significant aspects, such as the minority of pixel differences that matter most in an icon, while CLIP scores result in misjudging the quality of icons. The CLIP model's understanding of "similarity" is shaped by its own training data; which does not account for feature variation in our style of choice. Our findings highlight the need for specialized evaluation metrics and fine-tuning approaches when generating high-quality commercial icons, potentially leading to more effective and tailored applications of text-to-image models in professional design contexts.

Fine-Tuning Stable Diffusion XL for Stylistic Icon Generation: A Comparison of Caption Size

TL;DR

The paper tackles how to fine-tune Stable Diffusion XL for stylistic commercial icon generation, examining how caption size and training data shape output quality. By training multiple SDXL variants with short and long prompts, as well as class images, and comparing against DALL-E 3, the study reveals that short prompts with class images often optimize objective metrics (FID/CLIP) but can misalign with human judgments, while long prompts may better capture targeted style in some cases. The findings highlight that FID and CLIP alone do not reliably reflect icon quality, especially for tiny, style-specific graphics, and underscore the need for bespoke evaluation criteria and data choices. The work emphasizes the practical impact of data composition and prompting on commercial icon generation, suggesting directions for larger datasets and more nuanced metrics to better support real-world design workflows.

Abstract

In this paper, we show different fine-tuning methods for Stable Diffusion XL; this includes inference steps, and caption customization for each image to align with generating images in the style of a commercial 2D icon training set. We also show how important it is to properly define what "high-quality" really is especially for a commercial-use environment. As generative AI models continue to gain widespread acceptance and usage, there emerge many different ways to optimize and evaluate them for various applications. Specifically text-to-image models, such as Stable Diffusion XL and DALL-E 3 require distinct evaluation practices to effectively generate high-quality icons according to a specific style. Although some images that are generated based on a certain style may have a lower FID score (better), we show how this is not absolute in and of itself even for rasterized icons. While FID scores reflect the similarity of generated images to the overall training set, CLIP scores measure the alignment between generated images and their textual descriptions. We show how FID scores miss significant aspects, such as the minority of pixel differences that matter most in an icon, while CLIP scores result in misjudging the quality of icons. The CLIP model's understanding of "similarity" is shaped by its own training data; which does not account for feature variation in our style of choice. Our findings highlight the need for specialized evaluation metrics and fine-tuning approaches when generating high-quality commercial icons, potentially leading to more effective and tailored applications of text-to-image models in professional design contexts.
Paper Structure (14 sections, 21 figures, 1 table)

This paper contains 14 sections, 21 figures, 1 table.

Figures (21)

  • Figure 1: Distribution of file types for icon images related to kitchen cabinets, kitchens, and screws in the provided commercial data.
  • Figure 2: A visual representation of how the model learns to generate images according to a style
  • Figure 3: FID and CLIP scores of screw icons generated using commercial icons for training (non-public data)
  • Figure 4: Example generated icons based on commercial training + long prompts (each caption above is what was used at inference time)
  • Figure 5: Example generated icons based on commercial training + short prompts (each caption above is what was used at inference time)
  • ...and 16 more figures