Table of Contents
Fetching ...

Vipera: Blending Visual and LLM-Driven Guidance for Systematic Auditing of Text-to-Image Generative AI

Yanwei Huang, Wesley Hanwen Deng, Sijia Xiao, Motahhare Eslami, Jason I. Hong, Arpit Narechania, Adam Perer

TL;DR

Vipera addresses the challenge of auditing large-scale text-to-image models by integrating a scene-graph–driven visual sensemaking interface with LLM-powered guidance to systematically explore outputs. The authors validate the approach with formative work and a controlled user study (n=24), showing that multimodal guidance reduces cognitive load while enabling more thorough audits across diverse criteria. They report that visual cues and AI-driven suggestions are complementary, leading to higher quality insights and more efficient workflows than ablated versions. The work provides design implications for deploying visually guided, AI-assisted auditing tools in real-world responsible-AI practices and highlights avenues for personalization and broader applicability.

Abstract

Despite their increasing capabilities, text-to-image generative AI systems are known to produce biased, offensive, and otherwise problematic outputs. While recent advancements have supported testing and auditing of generative AI, existing auditing methods still face challenges in supporting effectively explore the vast space of AI-generated outputs in a structured way. To address this gap, we conducted formative studies with five AI auditors and synthesized five design goals for supporting systematic AI audits. Based on these insights, we developed Vipera, an interactive auditing interface that employs multiple visual cues including a scene graph to facilitate image sensemaking and inspire auditors to explore and hierarchically organize the auditing criteria. Additionally, Vipera leverages LLM-powered suggestions to facilitate exploration of unexplored auditing directions. Through a controlled experiment with 24 participants experienced in AI auditing, we demonstrate Vipera's effectiveness in helping auditors navigate large AI output spaces and organize their analyses while engaging with diverse criteria.

Vipera: Blending Visual and LLM-Driven Guidance for Systematic Auditing of Text-to-Image Generative AI

TL;DR

Vipera addresses the challenge of auditing large-scale text-to-image models by integrating a scene-graph–driven visual sensemaking interface with LLM-powered guidance to systematically explore outputs. The authors validate the approach with formative work and a controlled user study (n=24), showing that multimodal guidance reduces cognitive load while enabling more thorough audits across diverse criteria. They report that visual cues and AI-driven suggestions are complementary, leading to higher quality insights and more efficient workflows than ablated versions. The work provides design implications for deploying visually guided, AI-assisted auditing tools in real-world responsible-AI practices and highlights avenues for personalization and broader applicability.

Abstract

Despite their increasing capabilities, text-to-image generative AI systems are known to produce biased, offensive, and otherwise problematic outputs. While recent advancements have supported testing and auditing of generative AI, existing auditing methods still face challenges in supporting effectively explore the vast space of AI-generated outputs in a structured way. To address this gap, we conducted formative studies with five AI auditors and synthesized five design goals for supporting systematic AI audits. Based on these insights, we developed Vipera, an interactive auditing interface that employs multiple visual cues including a scene graph to facilitate image sensemaking and inspire auditors to explore and hierarchically organize the auditing criteria. Additionally, Vipera leverages LLM-powered suggestions to facilitate exploration of unexplored auditing directions. Through a controlled experiment with 24 participants experienced in AI auditing, we demonstrate Vipera's effectiveness in helping auditors navigate large AI output spaces and organize their analyses while engaging with diverse criteria.

Paper Structure

This paper contains 36 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: The Vipera interface. (A) The input box for creating prompts and specifying the number of images. (B) The analysis view for interactive auditing. Left: The prompts and AI-powered suggestions of auditing criteria and prompts. Right: The generated images and an interactive scene graph summarizing the image semantics to inspire users and guide the auditing process. (C) The note view for composing an auditing report.
  • Figure 2: The ViperaBase prototype used in the formative study, showing the generated images (right) and a scene graph (left) for the user prompt (top). The scene graph is a node-link diagram where nodes indicate objects (or their attributes) within the images and edges indicate the semantic relationships. Bar charts will be shown when hovering on attribute nodes.
  • Figure 3: View coordinations between the image view and the scene graph. When the user hovers on an image, a pop-up will appear showing its labels, and the relevant attribute nodes in the scene graph will be highlighted. When the user hovers on a bar in an embedded bar chart in the attribute node, the corresponding images will be highlighted in the image view.
  • Figure 4: Vipera's technical pipeline. Black edges indicate the data flow: (A) The prompt from the user is fed to T2I model to generate images. (B) A scene graph is generated from the images (see \ref{['fig:scene_graph_gen']} for details). (C) Upon an attribute node is added to the scene graph, the images will be labeled using the specified criteria. (D) The labels will be aggregated and visualized as a stacked bar chart. (E) The charts will be embedded into the attribute nodes of the scene graph for rendering. (F) The images and charts will be added to the auditing report upon being bookmarked by the user, with the user's comments attached to them. Blue edges indicate the inspiration flow for iterative auditing: Prompts and auditing criteria can be inspired by inspecting the images (a), thinking with the scene graph (b), and applying the LLM-driven guidance (c).
  • Figure 5: The detailed technical pipeline for generating the initial scene graph. (A) A random subset is sampled from the images. (B) For each image, an individual scene graph is generated through an omni-modal LLM. (C) The result scene graphs are merged into one. (D) The result is pruned by keeping a maximum of five leaf nodes.
  • ...and 3 more figures