Table of Contents
Fetching ...

What is "Typological Diversity" in NLP?

Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva

TL;DR

In this meta-analysis, this meta-analysis systematically investigate NLP research that includes claims regarding typological diversity and shows that skewed language selection can lead to overestimated multilingual performance.

Abstract

The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being 'typologically diverse'. In this work, we systematically investigate NLP research that includes claims regarding 'typological diversity'. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers. Crucially, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of 'typological diversity' that empirically justifies the diversity of language samples.

What is "Typological Diversity" in NLP?

TL;DR

In this meta-analysis, this meta-analysis systematically investigate NLP research that includes claims regarding typological diversity and shows that skewed language selection can lead to overestimated multilingual performance.

Abstract

The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being 'typologically diverse'. In this work, we systematically investigate NLP research that includes claims regarding 'typological diversity'. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers. Crucially, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of 'typological diversity' that empirically justifies the diversity of language samples.
Paper Structure (26 sections, 2 equations, 9 figures, 5 tables)

This paper contains 26 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: There is an increase in the number of publications with 'typological diversity' claims across time.
  • Figure 2: Number of papers with a claim by venue.
  • Figure 3: Papers by number of used languages.
  • Figure 4: Map of languages in all papers claiming 'typological diversity', where the hue corresponds number of papers that uses a language. Coordinates are taken from WALS.
  • Figure 5: Distributions of mean pairwise lang2vec distances and feature inclusion per paper. On the left are approximations based on common justifications for claiming 'typological diversity': geography ($\mu=0.28$, $\sigma=0.11$) and genealogy ($\mu=0.94$, $\sigma=0.05$). On the right two different approximations based on typological features: MPSD ($\mu=0.64$, $\sigma=0.07$) and Grambank feature value inclusion ($\mu=0.72$, $\sigma=0.17$).
  • ...and 4 more figures