Table of Contents
Fetching ...

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Anjishnu Mukherjee, Ziwei Zhu, Antonios Anastasopoulos

TL;DR

A comprehensive three-phase study to ex-amine the cultural understanding of Large Multimodal Models by introducing Dalle Street, a large-scale dataset generated by DALL-E 3 and validated by hu-mans, revealing a nuanced picture of the cultural competence of LMMs.

Abstract

We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline, CultureAdapt. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DalleStreet and other existing benchmarks, which we try to understand using over 18,000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

TL;DR

A comprehensive three-phase study to ex-amine the cultural understanding of Large Multimodal Models by introducing Dalle Street, a large-scale dataset generated by DALL-E 3 and validated by hu-mans, revealing a nuanced picture of the cultural competence of LMMs.

Abstract

We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline, CultureAdapt. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DalleStreet and other existing benchmarks, which we try to understand using over 18,000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads
Paper Structure (82 sections, 1 equation, 29 figures, 5 tables)

This paper contains 82 sections, 1 equation, 29 figures, 5 tables.

Figures (29)

  • Figure 1: We introduce a large-scale dataset for measuring cultural awareness, an artifact extraction task for implicit cultural associations, and a modular pipeline for culturally adapting images with fine-grained edits.
  • Figure 2: LLaVA matches or outperforms GPT-4V on two of three datasets. Human accuracy on a Dalle Street subset is 47.63%.
  • Figure 3: Confusion matrices for GPT-4V on the cultural awareness task for Dalle Street images. Accurate responses match the true subregion. Special labels include Invalid (no match or incomplete) and ResponsibleAI (policy violation). Takeaway: The model performs well, with a strong leading diagonal and $100\%$ accuracy for Western Asia (which covers Iran, Jordan, Lebanon, Oman, Palestine, Turkey).
  • Figure 4: We normalize Dollar Street income data into region-specific quartiles and plot accuracies for GPT-4V. Takeaway: Lower income quartiles (Q$1$, Q$2$) show higher accuracy in Africa and Asia, while higher quartiles (Q$3$, Q$4$) perform better in the Americas. In Europe, accuracy is similar across all quartiles.
  • Figure 5: We score each artifact based on its likelihood of co-occurrence for a country. Scores outside the mean and standard deviation range (red) indicate frequent co-occurrences, representing implicit (potentially stereotypical) associations.
  • ...and 24 more figures