Table of Contents
Fetching ...

Class-Conditional self-reward mechanism for improved Text-to-Image models

Safouane El Ghazouali, Arnaud Gucciardi, Umberto Michelucci

TL;DR

The paper proposes Class-Conditional Self-Rewarding (CCSR), a fully automated self-improvement loop for Text-to-Image diffusion models that leverages image captioning and open-vocabulary detection to generate high-quality training pairs without human feedback. It uses an LLM to produce class-conditioned prompts, generates multiple candidate images per prompt, and employs an Image-to-Text model plus object detection to score and select optimal prompt-image pairs for LoRA-based fine-tuning of Stable Diffusion 2.1. Empirical results show improved image realism and alignment with prompts, with CLIP-based metrics and a reported win-rate against baselines. The approach offers a scalable, automated pathway to domain-specific T2I quality enhancements and sets the stage for broader class coverage and more robust evaluation pipelines.

Abstract

Self-rewarding have emerged recently as a powerful tool in the field of Natural Language Processing (NLP), allowing language models to generate high-quality relevant responses by providing their own rewards during training. This innovative technique addresses the limitations of other methods that rely on human preferences. In this paper, we build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models. This approach works by fine-tuning diffusion model on a self-generated self-judged dataset, making the fine-tuning more automated and with better data quality. The proposed mechanism makes use of other pre-trained models such as vocabulary based-object detection, image captioning and is conditioned by the a set of object for which the user might need to improve generated data quality. The approach has been implemented, fine-tuned and evaluated on stable diffusion and has led to a performance that has been evaluated to be at least 60\% better than existing commercial and research Text-to-image models. Additionally, the built self-rewarding mechanism allowed a fully automated generation of images, while increasing the visual quality of the generated images and also more efficient following of prompt instructions. The code used in this work is freely available on https://github.com/safouaneelg/SRT2I.

Class-Conditional self-reward mechanism for improved Text-to-Image models

TL;DR

The paper proposes Class-Conditional Self-Rewarding (CCSR), a fully automated self-improvement loop for Text-to-Image diffusion models that leverages image captioning and open-vocabulary detection to generate high-quality training pairs without human feedback. It uses an LLM to produce class-conditioned prompts, generates multiple candidate images per prompt, and employs an Image-to-Text model plus object detection to score and select optimal prompt-image pairs for LoRA-based fine-tuning of Stable Diffusion 2.1. Empirical results show improved image realism and alignment with prompts, with CLIP-based metrics and a reported win-rate against baselines. The approach offers a scalable, automated pathway to domain-specific T2I quality enhancements and sets the stage for broader class coverage and more robust evaluation pipelines.

Abstract

Self-rewarding have emerged recently as a powerful tool in the field of Natural Language Processing (NLP), allowing language models to generate high-quality relevant responses by providing their own rewards during training. This innovative technique addresses the limitations of other methods that rely on human preferences. In this paper, we build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models. This approach works by fine-tuning diffusion model on a self-generated self-judged dataset, making the fine-tuning more automated and with better data quality. The proposed mechanism makes use of other pre-trained models such as vocabulary based-object detection, image captioning and is conditioned by the a set of object for which the user might need to improve generated data quality. The approach has been implemented, fine-tuned and evaluated on stable diffusion and has led to a performance that has been evaluated to be at least 60\% better than existing commercial and research Text-to-image models. Additionally, the built self-rewarding mechanism allowed a fully automated generation of images, while increasing the visual quality of the generated images and also more efficient following of prompt instructions. The code used in this work is freely available on https://github.com/safouaneelg/SRT2I.
Paper Structure (18 sections, 1 equation, 10 figures, 1 table)

This paper contains 18 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Overview flowchart of self-rewarding mechanism for Text-2-image models. Self-rewarding mechanism groups 3 steps: (1) Self-judging function, (2) Image filtering, (3) optimal pairs extraction.
  • Figure 2: LLM prompting instructions and examples of the generated T2I prompts
  • Figure 3: The LoRA trained type models scores are subtracted to the base and fine-tuned models result for each prompt-score pair. A positive CLIP score opposition result is considered a win, and a negative result is a loss. A tie happens when the difference between the scores is less than $\pm$ 0.01. Each of the 50 validation prompts is tested 4(9) times with a different fixed seed.
  • Figure 4: Text-to-image compared results between the base model, our retrained StableDiffusion version at 0.4 and 0.7 LoRA weight intensity, and LimeWire Studio [Bluewillow] for comparison. Below the generated images the CLIP score measures the compatibility of image-prompt pairs.
  • Figure 5: Validation prompt example showcasing instruction following capabilities of the proposed self-rewarding mechanism. Image 1 is the original stable diffusion and after LoRA fine-tuning on self-generated dataset
  • ...and 5 more figures