Table of Contents
Fetching ...

A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety

Rakeen Rouf, Trupti Bavalatti, Osama Ahmed, Dhaval Potdar, Faraz Jawed

TL;DR

This paper systematically reviews open-source datasets used to study T2I model safety, focusing on harm coverage, prompt diversity, and labeling quality. Using the AIR-2024 taxonomy, the authors re-label prompts across datasets, quantify syntactic and semantic diversity with n-gram metrics and embeddings, and analyze language distribution and thematic content via GPT-4o tagging. They find a pronounced overrepresentation of Sexual Content, substantial labeling inconsistencies, limited multilingual coverage, and notable downstream artifacts in synthetic data, such as AI-generated gibberish. The work provides a data-centric assessment that informs dataset selection, highlights gaps for future dataset design, and emphasizes the need for standardized taxonomies, multilingual prompts, and robust data quality controls to improve T2I safety research and applications.

Abstract

Novel research aimed at text-to-image (T2I) generative AI safety often relies on publicly available datasets for training and evaluation, making the quality and composition of these datasets crucial. This paper presents a comprehensive review of the key datasets used in the T2I research, detailing their collection methods, compositions, semantic and syntactic diversity of prompts and the quality, coverage, and distribution of harm types in the datasets. By highlighting the strengths and limitations of the datasets, this study enables researchers to find the most relevant datasets for a use case, critically assess the downstream impacts of their work given the dataset distribution, particularly regarding model safety and ethical considerations, and also identify the gaps in dataset coverage and quality that future research may address.

A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety

TL;DR

This paper systematically reviews open-source datasets used to study T2I model safety, focusing on harm coverage, prompt diversity, and labeling quality. Using the AIR-2024 taxonomy, the authors re-label prompts across datasets, quantify syntactic and semantic diversity with n-gram metrics and embeddings, and analyze language distribution and thematic content via GPT-4o tagging. They find a pronounced overrepresentation of Sexual Content, substantial labeling inconsistencies, limited multilingual coverage, and notable downstream artifacts in synthetic data, such as AI-generated gibberish. The work provides a data-centric assessment that informs dataset selection, highlights gaps for future dataset design, and emphasizes the need for standardized taxonomies, multilingual prompts, and robust data quality controls to improve T2I safety research and applications.

Abstract

Novel research aimed at text-to-image (T2I) generative AI safety often relies on publicly available datasets for training and evaluation, making the quality and composition of these datasets crucial. This paper presents a comprehensive review of the key datasets used in the T2I research, detailing their collection methods, compositions, semantic and syntactic diversity of prompts and the quality, coverage, and distribution of harm types in the datasets. By highlighting the strengths and limitations of the datasets, this study enables researchers to find the most relevant datasets for a use case, critically assess the downstream impacts of their work given the dataset distribution, particularly regarding model safety and ethical considerations, and also identify the gaps in dataset coverage and quality that future research may address.

Paper Structure

This paper contains 39 sections, 5 equations, 34 figures, 5 tables.

Figures (34)

  • Figure 1: The complete dataset mapped onto AIR categories. 'Unknown' refers to prompts that were not mapped onto any of the categories in AIR.
  • Figure 2: The figure outlines the steps that an input prompt goes through in generating the L2 and L3 labels.
  • Figure 3: This figure shows the number of prompts by class of harmful concept found in our compiled dataset. The current prompt curation of the T2I space is very heavily focused on sexual content. Less focus is put on categories such as violence.
  • Figure 4: This figure shows the composition of each dataset by benign and harmful concepts. Most datasets contain a significant number of benign prompts. Benign prompts are defined as prompts classified as innocuous by our AIR classifier. L2 categories are defined as the level 2 taxonomy/categories from the AIR taxonomy.
  • Figure 5: This figure shows the composition of each dataset by concept. Some of the most diverse datasets in our study were I2P, ART, P4D, and Latentguard CoPro. Diverse datasets are defined as those that have a balanced mix of harmful concepts. L2 categories are defined as the level 2 taxonomy/categories from the AIR taxonomy.
  • ...and 29 more figures