Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

Daniel Gallo Fernández; Răzvan-Andrei Matisan; Alejandro Monroy Muñoz; Janusz Partyka

Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

Daniel Gallo Fernández, Răzvan-Andrei Matisan, Alejandro Monroy Muñoz, Janusz Partyka

TL;DR

ITI-Gen addresses fairness in text-to-image generation by learning inclusive tokens for attribute categories via reference images, guided by two losses, $L_{dir}$ and $L_{sem}$, producing inclusive prompts to balance diversity and quality. The reproducibility study largely confirms the original claims—improved inclusiveness and cross-domain adaptability—with data- and compute-efficiency intact, but reveals scalability to many attributes is problematic due to exponential training time and entanglement. To address limitations, the authors explore Hard Prompt Search with negative prompting (HPSn) and show that combining ITI-Gen with HPSn yields strong inclusiveness while maintaining quality. The work also demonstrates the plug-and-play compatibility with ControlNet and reveals proxy features that can entangle attributes, underlining the need for careful reference-data selection and potential privacy considerations. Overall, the study provides practical guidance for deploying inclusive T2I systems and highlights a complementary path to handle negations and continuous attributes.

Abstract

Text-to-image generative models often present issues regarding fairness with respect to certain sensitive attributes, such as gender or skin tone. This study aims to reproduce the results presented in "ITI-GEN: Inclusive Text-to-Image Generation" by Zhang et al. (2023a), which introduces a model to improve inclusiveness in these kinds of models. We show that most of the claims made by the authors about ITI-GEN hold: it improves the diversity and quality of generated images, it is scalable to different domains, it has plug-and-play capabilities, and it is efficient from a computational point of view. However, ITI-GEN sometimes uses undesired attributes as proxy features and it is unable to disentangle some pairs of (correlated) attributes such as gender and baldness. In addition, when the number of considered attributes increases, the training time grows exponentially and ITI-GEN struggles to generate inclusive images for all elements in the joint distribution. To solve these issues, we propose using Hard Prompt Search with negative prompting, a method that does not require training and that handles negation better than vanilla Hard Prompt Search. Nonetheless, Hard Prompt Search (with or without negative prompting) cannot be used for continuous attributes that are hard to express in natural language, an area where ITI-GEN excels as it is guided by images during training. Finally, we propose combining ITI-GEN and Hard Prompt Search with negative prompting.

Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

TL;DR

ITI-Gen addresses fairness in text-to-image generation by learning inclusive tokens for attribute categories via reference images, guided by two losses,

and

, producing inclusive prompts to balance diversity and quality. The reproducibility study largely confirms the original claims—improved inclusiveness and cross-domain adaptability—with data- and compute-efficiency intact, but reveals scalability to many attributes is problematic due to exponential training time and entanglement. To address limitations, the authors explore Hard Prompt Search with negative prompting (HPSn) and show that combining ITI-Gen with HPSn yields strong inclusiveness while maintaining quality. The work also demonstrates the plug-and-play compatibility with ControlNet and reveals proxy features that can entangle attributes, underlining the need for careful reference-data selection and potential privacy considerations. Overall, the study provides practical guidance for deploying inclusive T2I systems and highlights a complementary path to handle negations and continuous attributes.

Abstract

Paper Structure (28 sections, 4 equations, 11 figures, 5 tables)

This paper contains 28 sections, 4 equations, 11 figures, 5 tables.

Introduction
Scope of reproducibility
Methodology
Model description
Datasets
Hyperparameters
Experimental setup and code
Computational requirements
Results
Results reproducing original paper
Inclusive and high-quality generation
Scalability to different domains
Plug-and-play capabilities
Data and computational efficiency
Scalability to multiple attributes
...and 13 more sections

Figures (11)

Figure 1: Scalability to different domains. Images generated with ITI-Gen in two different domains: human faces and natural scenes. Each column corresponds to a different category of the attribute.
Figure 2: Plug-and-play capabilities.ITI-Gen is used to learn inclusive tokens for "Age" and "Gender" using the text prompt "a headshot of a person". These tokens are then applied to other similar text prompts.
Figure 3: Compatibility with ControlNet. We generate images using the prompt "photo of a famous woman" and human pose (left) as additional condition. The attribute of interest is "Age", which is trained using the text prompt "a headshot of a person".
Figure 4: Proxy features are used by the model for certain attributes.(a) Generated samples using the fair tokens for "Bald". All positive samples are men, whereas all negative samples are women, which indicates that "Gender" might be used as a proxy feature. (b) When combining the "Gender" and "Bald" attributes, ITI-Gen fails to generate samples of bald women. (c) HPS with negative prompting is able to accurately generate bald women. In (b) and (c), the KL divergence is computed using 104 manually labeled samples.
Figure 5: Ablation study on the diversity of reference datasets. Generated images for all category combinations of the "Gender" and "Eyeglasses" attributes for all variations of the "Eyeglasses" reference datasets introduced in Table \ref{['tab:eyeglasses-non-diverse-datasets']} .The complete reference dataset for the "Gender" attribute is used in the three cases. In the three cases, the KL divergence ($D_{\mathrm{KL}}$) is computed using 104 manually labeled samples \ref{['tab:eyeglasses-non-diverse-datasets']}.
...and 6 more figures

Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

TL;DR

Abstract

Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

Authors

TL;DR

Abstract

Table of Contents

Figures (11)