Table of Contents
Fetching ...

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Angéline Pouget, Lucas Beyer, Emanuele Bugliarello, Xiao Wang, Andreas Peter Steiner, Xiaohua Zhai, Ibrahim Alabdulmohsin

TL;DR

This work introduces the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs and underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.

Abstract

We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training data to English image-text pairs disadvantages communities of lower socioeconomic status and negatively impacts cultural understanding. Notably, this performance gap is not captured by - and even at odds with - the currently popular evaluation metrics derived from the Western-centric ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English content can improve cultural understanding without sacrificing performance on said popular benchmarks. Third, we introduce the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs. Our work underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

TL;DR

This work introduces the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs and underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.

Abstract

We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training data to English image-text pairs disadvantages communities of lower socioeconomic status and negatively impacts cultural understanding. Notably, this performance gap is not captured by - and even at odds with - the currently popular evaluation metrics derived from the Western-centric ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English content can improve cultural understanding without sacrificing performance on said popular benchmarks. Third, we introduce the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs. Our work underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.
Paper Structure (20 sections, 1 equation, 5 figures, 4 tables)

This paper contains 20 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Models trained on English image-text pairs exhibit a lack of diversity when evaluated on images from other regions, sometimes confusing landmarks with similar ones located in the West.
  • Figure 2: Data distribution [%] for each of the evaluation datasets, only approximate in MaRVL marvl_dataset based on the 5 languages collected in the dataset. Dollar Street rojas2022dollar, GeoDE ramaswamy2024geode, GLDv2 weyand2020google and XM3600 etxaniz2023multilingual are geographically diverse. MaRVL is included because it focuses on underrepresented regions, such as Asia and East Africa. By comparison, ImageNet examples are mostly from a few Western countries (see for instance shankar2017no). COCO has a nearly identical distribution to ImageNet de2019does.
  • Figure 3: Filtering to English-only data further exacerbates existing performance disparities across socioeconomic subgroups. left: Zero-shot classification results for Dollar Street, disaggregated by income level ($x$-axis). The performance difference between en and globe-tl is larger for lower-income households. Also, the performance disparity between the lowest and highest income groups is 32.5% in en (from 29.9% in $0-200 income group to 62.4% in $1998+ income group), but this gap is reduced (improved) to 27.4% in globe-tl. right: MaRVL Concepts classification accuracy disaggregated by each of the five languages/regions: Pretraining on globe-tl improves performance for Indonesian, Turkish and Mandarin Chinese and yields a similar performance to en for Tamil and Swahili.
  • Figure 4: Fine-tuning globe-tl on en quickly catches up with en for ImageNet zero-shot evaluation while also performing better on GLDv2. Conversely, fine-tuning en on globe-tl does not suffice to close the gap in performance on culturally diverse benchmarks.
  • Figure 5: left: Fine-tuning allows for a controlled trade-off between cultural diversity and performance on standard benchmarks. Fine-tuning globe-tl on en is strictly better than fine-tuning en on globe-tl, but mixing training data in different proportions achieves a better trade-off overall. Values in percentages [%] correspond to the fraction of time training is restricted to en data. right: Correlation coefficients of the evaluation metrics computed based on over $40$ fully trained models.