Table of Contents
Fetching ...

A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Tanja Samardzic, Ximena Gutierrez, Christian Bentz, Steven Moran, Olga Pelloni

TL;DR

The paper addresses the problem that traditional multilingual NLP benchmarks infer diversity from language counts or families, failing to account for structural linguistic differences. It introduces $J_{mm}$, a minmax Jaccard-based metric, to compare dataset diversity against a reference and combines grammar features from typological databases with a text-based mean word length proxy, with TI measures for cross-validation. The findings show that many benchmarks underperform in linguistic diversity relative to their size, especially in morphology-rich languages, and that $J_{mm}$ provides transparent insights into which language types are missing. This work offers an automatic, interpretable framework to guide the construction of more typologically representative multilingual benchmarks, reducing biases against low-resource languages.

Abstract

Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of linguistic features and can identify the types of languages that are not represented in a data set. Using our method, we analyse a range of popular multilingual data sets (UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD). In addition to ranking these data sets, we find, for example, that (poly)synthetic languages are missing in almost all of them.

A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

TL;DR

The paper addresses the problem that traditional multilingual NLP benchmarks infer diversity from language counts or families, failing to account for structural linguistic differences. It introduces , a minmax Jaccard-based metric, to compare dataset diversity against a reference and combines grammar features from typological databases with a text-based mean word length proxy, with TI measures for cross-validation. The findings show that many benchmarks underperform in linguistic diversity relative to their size, especially in morphology-rich languages, and that provides transparent insights into which language types are missing. This work offers an automatic, interpretable framework to guide the construction of more typologically representative multilingual benchmarks, reducing biases against low-resource languages.

Abstract

Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of linguistic features and can identify the types of languages that are not represented in a data set. Using our method, we analyse a range of popular multilingual data sets (UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD). In addition to ranking these data sets, we find, for example, that (poly)synthetic languages are missing in almost all of them.
Paper Structure (16 sections, 2 equations, 4 figures, 5 tables)

This paper contains 16 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Languages in the WALS 100L sample with their endangerment status.
  • Figure 2: A toy example of comparing sets of measures with the minmax version of the Jaccard index.
  • Figure 3: Union and intersection between the distributions of the mean word length in TeDDi and NLP data sets.
  • Figure 4: Mean word length measures at different text sizes in TeDDi. The languages on the x-axis are sorted according to the increasing value calculated on the biggest sample (10K). The values in the two smaller samples (2K and 500) depart very little from the main trend.