Table of Contents
Fetching ...

Since the Scientific Literature Is Multilingual, Our Models Should Be Too

Abteen Ebrahimi, Kenneth Church

TL;DR

This work shows that the scientific literature is markedly multilingual, with English accounting for about $85.11\%$ of abstracts while 50 other languages contribute substantially to the corpus. It demonstrates that English-only models produce poor representations for non-English papers, evidenced by heavy UNK-token rates (e.g., $>60\%$ for several scripts) and rising pseudo-perplexity across unseen languages, unlike multilingual baselines such as XLM-R. The authors argue that current benchmarks and models inadequately reflect linguistic diversity and propose concrete directions—language detection and translation, text-agnostic methods, multilingual training, and multilingual evaluation datasets—to improve performance on non-English documents. They illustrate real-world risks in user-facing outputs (tl;dr) for non-Latin languages and emphasize the need for multilingual benchmarks to drive progress in scientific document representation. Overall, the paper advocates a shift from English-centric NLP to robust multilingual modeling to ensure equitable and accurate processing of scientific content worldwide.

Abstract

English has long been assumed the $\textit{lingua franca}$ of scientific research, and this notion is reflected in the natural language processing (NLP) research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain. We end with suggestions for the NLP community on how to improve performance on non-English documents.

Since the Scientific Literature Is Multilingual, Our Models Should Be Too

TL;DR

This work shows that the scientific literature is markedly multilingual, with English accounting for about of abstracts while 50 other languages contribute substantially to the corpus. It demonstrates that English-only models produce poor representations for non-English papers, evidenced by heavy UNK-token rates (e.g., for several scripts) and rising pseudo-perplexity across unseen languages, unlike multilingual baselines such as XLM-R. The authors argue that current benchmarks and models inadequately reflect linguistic diversity and propose concrete directions—language detection and translation, text-agnostic methods, multilingual training, and multilingual evaluation datasets—to improve performance on non-English documents. They illustrate real-world risks in user-facing outputs (tl;dr) for non-Latin languages and emphasize the need for multilingual benchmarks to drive progress in scientific document representation. Overall, the paper advocates a shift from English-centric NLP to robust multilingual modeling to ensure equitable and accurate processing of scientific content worldwide.

Abstract

English has long been assumed the of scientific research, and this notion is reflected in the natural language processing (NLP) research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain. We end with suggestions for the NLP community on how to improve performance on non-English documents.
Paper Structure (17 sections, 1 equation, 3 figures, 3 tables)

This paper contains 17 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Main results. Black dotted line represents the total number of abstracts for each language. Bars represented the proportion of tokens which are unknown on average for that language, for a monolingual vs. multilingual model.
  • Figure 2: Pseudo-Perplexity (PPPL) per subword for languages with low unknown token counts.
  • Figure 3: Relatively small cosines for non-English papers suggest opportunities for improvement.