Cross-Lingual and Cross-Cultural Variation in Image Descriptions

Uri Berger; Edoardo M. Ponti

Cross-Lingual and Cross-Cultural Variation in Image Descriptions

Uri Berger, Edoardo M. Ponti

TL;DR

This work conducts the first large-scale empirical study of cross-lingual variation in image descriptions, using a multimodal dataset with 31 languages and images from diverse locations and reveals the presence of both universal and culture-specific patterns in entity mentions.

Abstract

Do speakers of different languages talk differently about what they see? Behavioural and cognitive studies report cultural effects on perception; however, these are mostly limited in scope and hard to replicate. In this work, we conduct the first large-scale empirical study of cross-lingual variation in image descriptions. Using a multimodal dataset with 31 languages and images from diverse locations, we develop a method to accurately identify entities mentioned in captions and present in the images, then measure how they vary across languages. Our analysis reveals that pairs of languages that are geographically or genetically closer tend to mention the same entities more frequently. We also identify entity categories whose saliency is universally high (such as animate beings), low (clothing accessories) or displaying high variance across languages (landscape). In a case study, we measure the differences in a specific language pair (e.g., Japanese mentions clothing far more frequently than English). Furthermore, our method corroborates previous small-scale studies, including 1) Rosch et al. (1976)'s theory of basic-level categories, demonstrating a preference for entities that are neither too generic nor too specific, and 2) Miyamoto et al. (2006)'s hypothesis that environments afford patterns of perception, such as entity counts. Overall, our work reveals the presence of both universal and culture-specific patterns in entity mentions.

Cross-Lingual and Cross-Cultural Variation in Image Descriptions

TL;DR

Abstract

Paper Structure (39 sections, 2 equations, 4 figures, 2 tables)

This paper contains 39 sections, 2 equations, 4 figures, 2 tables.

Introduction
Background and Related Work
Small-Scale Controlled Studies
Large-Scale Studies
WordNet
Methods
Captions Translation
Synset Selection
Synset Extraction
Noun phrase extraction.
Synset identification.
Resolving ambiguities in synset mapping.
Synset Filtering
Validation
Experiments
...and 24 more sections

Figures (4)

Figure 1: A photo taken in an Indonesian-speaking area, corresponding captions in Indonesian (id) and Dutch (nl) translated to English, and saliency for the person.n.01 and dancer.n.01 synsets. Saliency is measured as the proportion of captions referring to the synset or any of its descendants (e.g., dancer.n.01 is a descendant of person.n.01).
Figure 2: Violin plot of the distribution of saliency scores across languages for each entity category. Saliency scores are averaged across all images containing the entity. Silver lines indicate quartiles.
Figure 3: Distribution of depths in the WordNet synset hierarchy across languages.
Figure 4: Average number of entities mentioned by speakers of 31 languages in captions of images captured in different locations. The dashed line indicates identity.

Cross-Lingual and Cross-Cultural Variation in Image Descriptions

TL;DR

Abstract

Cross-Lingual and Cross-Cultural Variation in Image Descriptions

Authors

TL;DR

Abstract

Table of Contents

Figures (4)