Table of Contents
Fetching ...

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

Nari Johnson, Deepthi Sudharsan, Hamna, Samantha Dalal, Theo Holroyd, Anja Thieme, Hoda Heidari, Daniela Massiceti, Jennifer Wortman Vaughan, Cecily Morrison

Abstract

Measurement is essential to improving AI performance and mitigating harms for marginalized groups. As generative AI systems are rapidly deployed across geographies and contexts, AI measurement practices must be designed to support repeatable, automatable application across different models, datasets, and evaluation settings. But the drive to automate measurement can be in tension with the ability for measurement instruments to capture the expertise and perspectives of communities impacted by AI. Recent work advocates for breaking measurement into several key stages: first moving from an abstract concept to be measured into a precise, "systematized" concept; next operationalizing the systematized concept into a concrete measurement instrument; and finally applying the measurement instrument on data to produce measurements. This opens up an opportunity to concentrate community engagement in the systematization phase before operationalizing and applying measurement instruments. In this paper, we explore how to involve communities in systematizing the concept of "cultural appropriateness" in text-to-image models' representation of culturally significant artifacts through case studies with three communities: blind and low vision individuals residing in the UK, residents of Kerala, and residents of Tamil Nadu. Our systematized concepts reflect community members' lived experiences interacting with each artifact and how they want their material culture to be depicted, demonstrating the value of community involvement in defining valid measures. We explore how these systematized concepts can be operationalized into automated measurement instruments that could be applied using a multimodal LLM-as-a-judge approach and challenges that remain. We reflect on the benefits and limitations of such approaches.

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

Abstract

Measurement is essential to improving AI performance and mitigating harms for marginalized groups. As generative AI systems are rapidly deployed across geographies and contexts, AI measurement practices must be designed to support repeatable, automatable application across different models, datasets, and evaluation settings. But the drive to automate measurement can be in tension with the ability for measurement instruments to capture the expertise and perspectives of communities impacted by AI. Recent work advocates for breaking measurement into several key stages: first moving from an abstract concept to be measured into a precise, "systematized" concept; next operationalizing the systematized concept into a concrete measurement instrument; and finally applying the measurement instrument on data to produce measurements. This opens up an opportunity to concentrate community engagement in the systematization phase before operationalizing and applying measurement instruments. In this paper, we explore how to involve communities in systematizing the concept of "cultural appropriateness" in text-to-image models' representation of culturally significant artifacts through case studies with three communities: blind and low vision individuals residing in the UK, residents of Kerala, and residents of Tamil Nadu. Our systematized concepts reflect community members' lived experiences interacting with each artifact and how they want their material culture to be depicted, demonstrating the value of community involvement in defining valid measures. We explore how these systematized concepts can be operationalized into automated measurement instruments that could be applied using a multimodal LLM-as-a-judge approach and challenges that remain. We reflect on the benefits and limitations of such approaches.

Paper Structure

This paper contains 52 sections, 17 figures, 15 tables.

Figures (17)

  • Figure 1: Scaffolding community engagement to develop community-centered measures of cultural representation. Given an input prompt (e.g., "a photo of a guide cane"), we invited community members to participate in designing a rubric that captures their expertise and preferences for each cultural artifact (systematization). Our research team then explored the use of this rubric within an automated multimodal LLM-as-a-judge pipeline (operationalization).
  • Figure 2: Measurement framework from the social sciences adcock2001measurementwallach2025position. Our research studies how to center community knowledge in the systematization process before operationalizing the systematized concept as an automated MLLM-as-a-judge system.
  • Figure 3: Selected culturally significant artifacts. From right to left: (1) With the blind and low vision community, we selected a guide cane (a mobility aid that is held diagonally across one's body) and a braille notetaker (an electronic device that can be used to read and write notes in tactile braille). (2) With residents of Tamil Nadu, we selected Pallanguzhi (a two-player mancala game where players compete to collect cowry shells or seeds) and Mridangam (a percussion instrument widely used in South Asian classical music). (3) With residents of Kerala, we selected a Kasavu saree (a handwoven saree from Kerala, known for its off-white body with a gold border), and Chundan Vallam (a traditional boat from Kerala with a raised prow commonly used in festival races).
  • Figure 4: A rubric to score images of a guide cane, designed with BLV community members. Criteria that correspond to visual features in images are organized under two themes that describe participants' desires for cultural representation.
  • Figure 5: Human-MLLM judge alignment for individual rubric criteria. A histogram that shows the human-MLLM agreement rate for individual rubric criteria. We find that there is high variance across criteria in the MLLM's ability to annotate a criterion accurately, such as the example criterion on the left, where GPT 4-o has low accuracy (agreement rate 0.46) at annotating whether a drum's head is made of the correct material; and the criterion on the right, where GPT 4-o has high accuracy (agreement rate 0.98) at determining whether a cane is white in color.
  • ...and 12 more figures