Table of Contents
Fetching ...

DiffusionWorldViewer: Exposing and Broadening the Worldview Reflected by Generative Text-to-Image Models

Zoe De Simone, Angie Boggust, Arvind Satyanarayan, Ashia Wilson

TL;DR

Generative TTI systems encode a worldview from training data that may misalign with user perspectives. The authors introduce DiffusionWorldViewer, an interactive interface that exposes demographic distributions in TTI outputs and enables editing toward user-valued worldviews using semantic guidance, without retraining. They formalize a worldview framework based on CAPTA and ARROWS, and implement back-end and front-end components to surface, compare, and adjust outputs via four editing modes (parity, US demographics, absolute, relative). A user study with 18 diverse participants and two case studies show that the tool increases awareness of model biases, broadens representation, and supports task-dependent editing, while highlighting trade-offs and ethical considerations. The work lays a foundation for co-adaptive, user-aware customization of worldview in diffusion-based image synthesis and points to future work on expanding editing categories and multi-user composition of worldviews.

Abstract

Generative text-to-image (TTI) models produce high-quality images from short textual descriptions and are widely used in academic and creative domains. Like humans, TTI models have a worldview, a conception of the world learned from their training data and task that influences the images they generate for a given prompt. However, the worldviews of TTI models are often hidden from users, making it challenging for users to build intuition about TTI outputs, and they are often misaligned with users' worldviews, resulting in output images that do not match user expectations. In response, we introduce DiffusionWorldViewer, an interactive interface that exposes a TTI model's worldview across output demographics and provides editing tools for aligning output images with user perspectives. In a user study with 18 diverse TTI users, we find that DiffusionWorldViewer helps users represent their varied viewpoints in generated images and challenge the limited worldview reflected in current TTI models.

DiffusionWorldViewer: Exposing and Broadening the Worldview Reflected by Generative Text-to-Image Models

TL;DR

Generative TTI systems encode a worldview from training data that may misalign with user perspectives. The authors introduce DiffusionWorldViewer, an interactive interface that exposes demographic distributions in TTI outputs and enables editing toward user-valued worldviews using semantic guidance, without retraining. They formalize a worldview framework based on CAPTA and ARROWS, and implement back-end and front-end components to surface, compare, and adjust outputs via four editing modes (parity, US demographics, absolute, relative). A user study with 18 diverse participants and two case studies show that the tool increases awareness of model biases, broadens representation, and supports task-dependent editing, while highlighting trade-offs and ethical considerations. The work lays a foundation for co-adaptive, user-aware customization of worldview in diffusion-based image synthesis and points to future work on expanding editing categories and multi-user composition of worldviews.

Abstract

Generative text-to-image (TTI) models produce high-quality images from short textual descriptions and are widely used in academic and creative domains. Like humans, TTI models have a worldview, a conception of the world learned from their training data and task that influences the images they generate for a given prompt. However, the worldviews of TTI models are often hidden from users, making it challenging for users to build intuition about TTI outputs, and they are often misaligned with users' worldviews, resulting in output images that do not match user expectations. In response, we introduce DiffusionWorldViewer, an interactive interface that exposes a TTI model's worldview across output demographics and provides editing tools for aligning output images with user perspectives. In a user study with 18 diverse TTI users, we find that DiffusionWorldViewer helps users represent their varied viewpoints in generated images and challenge the limited worldview reflected in current TTI models.
Paper Structure (50 sections, 12 equations, 7 figures, 1 table)

This paper contains 50 sections, 12 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The DiffusionWorldViewer allows users to specify a new worldview to impose onto the model outputs: Parity (A), US demographics (B), Absolute (C) and Relative (D1-2) editing techniques. Using the Relative editing technique users can specify given the baseline demographic distributions from the baseline SD model, by how much they want to modify the distributions towards parity.
  • Figure 2: DiffusionWorldViewer applied to the case study Minority representation in professional context compare s Generative Text-to-Image model representations for the prompt "a photo of a ceo of a silicon valley start up". (A) The baseline images generated represent male, middle aged with light color skin. (B) The user applies the absolute category to represent CEOs that look like them, selecting the female, black and 30-39 and 40-49 year old individuals. The edited images all represent black individuals. Images 2,4, and 5 represent female individuals, while images 1 and 3 are ambiguous.
  • Figure 3: DiffusionWorldViewer applied to the case study Representing Community explores TTI model representations of "a young child doing a science experiment, image in the style of a blobby illustration" (A). The user explores different worldview editing techniques including, parity (B) and absolute editing (C--D) to generate illustrations that represent their audience community.
  • Figure 4: DiffusionWorldViewer applied to the case study Inclusive representations for marketing material compares Generative Text-to-Image outputs of the prompt "a marketing photo of a happy retirement home". (A) The baseline images reveal the underlying worldview of the model - people represented by the model are prevalently white, old with grey colored hair. (B) Selecting the U.S. demographics editing technique, modifies the images to include seemingly asian and black ethnicities in figures 2, 4, and 5.
  • Figure 5: DiffusionWorldViewer applied to generate more representative human scale figures in architectural design renders. A representative case study depicts an expert user generating a prompt based on a reference image (A) through text2prompt, and aiming to include scale figures that are representative of the neighborhood demographics where the building will be built, using prompt engineering (B-C), and using DiffusionWorldViewer (D). The user tests the prompt "a large building with a bunch of windows on top of it, a surrealist sculpture by Gaudi, featured on dribble, art nouveau, made of wrought iron, biomorphic, made of vines, a sidewalk below the building, woman walking on the sidewalk".
  • ...and 2 more figures