Table of Contents
Fetching ...

Fine Tuning Text-to-Image Diffusion Models for Correcting Anomalous Images

Hyunwoo Yoo

TL;DR

Text-to-image diffusion models sometimes produce aberrant outputs for certain prompts, limiting practical reliability. The authors propose a DreamBooth-based fine-tuning of Stable Diffusion 3 using LoRA adapters, trained on DALL-E-derived exemplars to specialize on a target prompt. The approach yields lower $FID$ and higher $SSIM$ and $PSNR$, with human surveys favoring the fine-tuned results, though large language models (LLMs) can diverge in their judgments. This work enhances the practicality and reliability of text-to-image generation for real-world applications, while highlighting residual artifacts and evaluation gaps between human and ML-based assessments.

Abstract

Since the advent of GANs and VAEs, image generation models have continuously evolved, opening up various real-world applications with the introduction of Stable Diffusion and DALL-E models. These text-to-image models can generate high-quality images for fields such as art, design, and advertising. However, they often produce aberrant images for certain prompts. This study proposes a method to mitigate such issues by fine-tuning the Stable Diffusion 3 model using the DreamBooth technique. Experimental results targeting the prompt "lying on the grass/street" demonstrate that the fine-tuned model shows improved performance in visual evaluation and metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Frechet Inception Distance (FID). User surveys also indicated a higher preference for the fine-tuned model. This research is expected to make contributions to enhancing the practicality and reliability of text-to-image models.

Fine Tuning Text-to-Image Diffusion Models for Correcting Anomalous Images

TL;DR

Text-to-image diffusion models sometimes produce aberrant outputs for certain prompts, limiting practical reliability. The authors propose a DreamBooth-based fine-tuning of Stable Diffusion 3 using LoRA adapters, trained on DALL-E-derived exemplars to specialize on a target prompt. The approach yields lower and higher and , with human surveys favoring the fine-tuned results, though large language models (LLMs) can diverge in their judgments. This work enhances the practicality and reliability of text-to-image generation for real-world applications, while highlighting residual artifacts and evaluation gaps between human and ML-based assessments.

Abstract

Since the advent of GANs and VAEs, image generation models have continuously evolved, opening up various real-world applications with the introduction of Stable Diffusion and DALL-E models. These text-to-image models can generate high-quality images for fields such as art, design, and advertising. However, they often produce aberrant images for certain prompts. This study proposes a method to mitigate such issues by fine-tuning the Stable Diffusion 3 model using the DreamBooth technique. Experimental results targeting the prompt "lying on the grass/street" demonstrate that the fine-tuned model shows improved performance in visual evaluation and metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Frechet Inception Distance (FID). User surveys also indicated a higher preference for the fine-tuned model. This research is expected to make contributions to enhancing the practicality and reliability of text-to-image models.
Paper Structure (13 sections, 7 figures, 2 tables)

This paper contains 13 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Not Effective Train Data Example
  • Figure 2: Effective Train Data Example
  • Figure 3: Output Example From Stable Diffusion 3 Model
  • Figure 4: Output Example From Fine-tuned Model
  • Figure 5: Chat GPT 4o Prompt Example
  • ...and 2 more figures