Table of Contents
Fetching ...

No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages

Youssef Mohamed, Runjia Li, Ibrahim Said Ahmad, Kilichbek Haydarov, Philip Torr, Kenneth Ward Church, Mohamed Elhoseiny

TL;DR

ArtELingo-28 is presented, a vision-language benchmark that spans 28 languages and encompasses approximately 200,000 annotations (140 annotations per image) and finds that cross-lingual transfer is more successful for culturally-related languages.

Abstract

Research in vision and language has made considerable progress thanks to benchmarks such as COCO. COCO captions focused on unambiguous facts in English; ArtEmis introduced subjective emotions and ArtELingo introduced some multilinguality (Chinese and Arabic). However we believe there should be more multilinguality. Hence, we present ArtELingo-28, a vision-language benchmark that spans $\textbf{28}$ languages and encompasses approximately $\textbf{200,000}$ annotations ($\textbf{140}$ annotations per image). Traditionally, vision research focused on unambiguous class labels, whereas ArtELingo-28 emphasizes diversity of opinions over languages and cultures. The challenge is to build machine learning systems that assign emotional captions to images. Baseline results will be presented for three novel conditions: Zero-Shot, Few-Shot and One-vs-All Zero-Shot. We find that cross-lingual transfer is more successful for culturally-related languages. Data and code are provided at www.artelingo.org.

No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages

TL;DR

ArtELingo-28 is presented, a vision-language benchmark that spans 28 languages and encompasses approximately 200,000 annotations (140 annotations per image) and finds that cross-lingual transfer is more successful for culturally-related languages.

Abstract

Research in vision and language has made considerable progress thanks to benchmarks such as COCO. COCO captions focused on unambiguous facts in English; ArtEmis introduced subjective emotions and ArtELingo introduced some multilinguality (Chinese and Arabic). However we believe there should be more multilinguality. Hence, we present ArtELingo-28, a vision-language benchmark that spans languages and encompasses approximately annotations ( annotations per image). Traditionally, vision research focused on unambiguous class labels, whereas ArtELingo-28 emphasizes diversity of opinions over languages and cultures. The challenge is to build machine learning systems that assign emotional captions to images. Baseline results will be presented for three novel conditions: Zero-Shot, Few-Shot and One-vs-All Zero-Shot. We find that cross-lingual transfer is more successful for culturally-related languages. Data and code are provided at www.artelingo.org.

Paper Structure

This paper contains 31 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Number of Annotations per Language
  • Figure 2: Number of Annotators per Language
  • Figure 3: Kullback-Leibler Divergence between the pairwise emotion distribution. The lighter the color the more emotion agreement between the languages.
  • Figure 4: Example Zero-Shot Generations. The top row is the performance on the test data from ArtELingo where the model has seen the languages during training. The second row corresponds to languages that the model has not seen during multimodal training.
  • Figure 5: One vs All Zero-Shot. The figure shows the rouge score on the target languages. On the left the clustering reveals cultural connections. The captioning scores reveal groups that align with real world cultural connections. This clustering suggests that our trained models can capture the cultural signal.
  • ...and 5 more figures