Seeing Twice: How Side-by-Side T2I Comparison Changes Auditing Strategies
Matheus Kunzler Maldaner, Wesley Hanwen Deng, Jason I. Hong, Kenneth Holstein, Motahhare Eslami
TL;DR
Seeing Twice introduces MIRAGE, a web-based tool for contrast-first auditing of text-to-image generation. By displaying up to four generators side-by-side and prompting reflection on prompts, MIRAGE changes how users perceive outputs and identify biases. The study shows that side-by-side comparisons shift attention from individual images to distribution-level patterns and reveals language-fidelity gaps across prompts in different languages. The work proposes future directions for larger controlled studies, anonymous auditing, and community feedback loops to enhance AI transparency.
Abstract
While generative AI systems have gained popularity in diverse applications, their potential to produce harmful outputs limits their trustworthiness and utility. A small but growing line of research has explored tools and processes to better engage non-AI expert users in auditing generative AI systems. In this work, we present the design and evaluation of MIRAGE, a web-based tool exploring a "contrast-first" workflow that allows users to pick up to four different text-to-image (T2I) models, view their images side-by-side, and provide feedback on model performance on a single screen. In our user study with fifteen participants, we used four predefined models for consistency, with only a single model initially being shown. We found that most participants shifted from analyzing individual images to general model output patterns once the side-by-side step appeared with all four models; several participants coined persistent "model personalities" (e.g., cartoonish, saturated) that helped them form expectations about how each model would behave on future prompts. Bilingual participants also surfaced a language-fidelity gap, as English prompts produced more accurate images than Portuguese or Chinese, an issue often overlooked when dealing with a single model. These findings suggest that simple comparative interfaces can accelerate bias discovery and reshape how people think about generative models.
