Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Anjishnu Mukherjee; Ziwei Zhu; Antonios Anastasopoulos

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Anjishnu Mukherjee, Ziwei Zhu, Antonios Anastasopoulos

TL;DR

A comprehensive three-phase study to ex-amine the cultural understanding of Large Multimodal Models by introducing Dalle Street, a large-scale dataset generated by DALL-E 3 and validated by hu-mans, revealing a nuanced picture of the cultural competence of LMMs.

Abstract

We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline, CultureAdapt. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DalleStreet and other existing benchmarks, which we try to understand using over 18,000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

TL;DR

Abstract

Paper Structure (82 sections, 1 equation, 29 figures, 5 tables)

This paper contains 82 sections, 1 equation, 29 figures, 5 tables.

Introduction
Data
Dalle Street
Dollar Street rojas-2022-dollarstreet
MaRVL liu-2021-marvl
Cultural Awareness (Task 1)
Methods
Evaluation Metrics
Economic disparities
Human Baseline
Results
Overall comparison
Subregion Level Analysis
Economic Disparity
Extracting Implicit Associations of Cultures and Artifacts (Task 2)
...and 67 more sections

Figures (29)

Figure 1: We introduce a large-scale dataset for measuring cultural awareness, an artifact extraction task for implicit cultural associations, and a modular pipeline for culturally adapting images with fine-grained edits.
Figure 2: LLaVA matches or outperforms GPT-4V on two of three datasets. Human accuracy on a Dalle Street subset is 47.63%.
Figure 3: Confusion matrices for GPT-4V on the cultural awareness task for Dalle Street images. Accurate responses match the true subregion. Special labels include Invalid (no match or incomplete) and ResponsibleAI (policy violation). Takeaway: The model performs well, with a strong leading diagonal and $100\%$ accuracy for Western Asia (which covers Iran, Jordan, Lebanon, Oman, Palestine, Turkey).
Figure 4: We normalize Dollar Street income data into region-specific quartiles and plot accuracies for GPT-4V. Takeaway: Lower income quartiles (Q$1$, Q$2$) show higher accuracy in Africa and Asia, while higher quartiles (Q$3$, Q$4$) perform better in the Americas. In Europe, accuracy is similar across all quartiles.
Figure 5: We score each artifact based on its likelihood of co-occurrence for a country. Scores outside the mean and standard deviation range (red) indicate frequent co-occurrences, representing implicit (potentially stereotypical) associations.
...and 24 more figures

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

TL;DR

Abstract

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (29)