Table of Contents
Fetching ...

Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation

Gianni Franchi, Dat Nguyen Trong, Nacim Belkhir, Guoxuan Xia, Andrea Pilzer

TL;DR

The paper tackles prompt-space uncertainty in text-to-image generation by introducing PUNC, a method that leverages Large Vision-Language Models to caption generated images and assess semantic alignment with the original prompt, enabling disentanglement of aleatoric and epistemic uncertainties in the text-to-image pipeline. It situates PUNC among existing image-space uncertainty approaches (diffusion/noise-based, time-step, and test-time ensembling) and provides a thorough background on diffusion models, forward and reverse processes, and SDE/ODE formulations. The authors validate PUNC across multiple diffusion-based T2I models and diverse prompt datasets, showing competitive performance and exposing domain-dependent limitations (e.g., texture data). They also demonstrate practical applications in deepfake detection, copyright risk assessment, and bias analysis, arguing that semantic uncertainty quantification can enhance reliability, safety, and policy compliance in generative systems. A new dataset of prompts and generation pairs is released to spur further research in uncertainty quantification for multimodal generation and trustworthy AI.

Abstract

Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in evaluating and improving the trustworthiness of text-to-image models.

Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation

TL;DR

The paper tackles prompt-space uncertainty in text-to-image generation by introducing PUNC, a method that leverages Large Vision-Language Models to caption generated images and assess semantic alignment with the original prompt, enabling disentanglement of aleatoric and epistemic uncertainties in the text-to-image pipeline. It situates PUNC among existing image-space uncertainty approaches (diffusion/noise-based, time-step, and test-time ensembling) and provides a thorough background on diffusion models, forward and reverse processes, and SDE/ODE formulations. The authors validate PUNC across multiple diffusion-based T2I models and diverse prompt datasets, showing competitive performance and exposing domain-dependent limitations (e.g., texture data). They also demonstrate practical applications in deepfake detection, copyright risk assessment, and bias analysis, arguing that semantic uncertainty quantification can enhance reliability, safety, and policy compliance in generative systems. A new dataset of prompts and generation pairs is released to spur further research in uncertainty quantification for multimodal generation and trustworthy AI.

Abstract

Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in evaluating and improving the trustworthiness of text-to-image models.

Paper Structure

This paper contains 29 sections, 14 equations, 22 figures, 10 tables.

Figures (22)

  • Figure 0: Examples of Applications for Uncertainty Quantification in Text-to-Image Generation. Text-to-image generation models may exhibit uncertainty, and that need to be quantified since it can provide insights into the model’s training dataset, aiding in deepfake prevention, detecting model biases, and protecting copyrighted content from unauthorized generation.
  • Figure 1: Diagram illustrating generation/image-space uncertainty which is considered in previous work and condition/prompt-space uncertainty which is investigated in our work.
  • Figure 2: Three generations from PixArt-$\Sigma$chen2024pixart illustrating uncertainty with regards to prompt semantics. For the Normal image, we used an ImageNet-inspired prompt; for the corrupted image, additional corruption was applied to the prompt (e.g.fish was perturbed to fis) increasing the aleatoric uncertainty; and for the out-of-distribution (OOD) case, the model was prompted to generate an image of the Prime Minister of Japan Kishida Fumio under similar conditions, where epistemic uncertainty was injected in the form of a semantic concept the model was not familiar with.
  • Figure 3: Illustration showing the different baselines and PUNC. PUNC leverages a LVLM to describe generated images and assess similarity with the original prompt, providing a refined uncertainty score. In contrast, baseline methods employ traditional techniques such as noise injection, ensembling, or masking to quantify uncertainty, followed by image-based similarity scoring.
  • Figure 4: Illustration of gender bias in diffusion models with respect to job representation.
  • ...and 17 more figures