XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Sebastian Ruder; Jonathan H. Clark; Alexander Gutkin; Mihir Kale; Min Ma; Massimo Nicosia; Shruti Rijhwani; Parker Riley; Jean-Michel A. Sarr; Xinyi Wang; John Wieting; Nitish Gupta; Anna Katanova; Christo Kirov; Dana L. Dickinson; Brian Roark; Bidisha Samanta; Connie Tao; David I. Adelani; Vera Axelrod; Isaac Caswell; Colin Cherry; Dan Garrette; Reeve Ingle; Melvin Johnson; Dmitry Panteleev; Partha Talukdar

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson, Dmitry Panteleev, Partha Talukdar

TL;DR

XTREME-UP presents a scarce-data, user-centric benchmark for 88 under-represented languages across 9 tasks, emphasizing realistic annotation budgets and multi-modal evaluation. It standardizes in-language fine-tuning with an $8$ hour per-UL limit, provides new data for OCR, autocomplete, transliteration, semantic parsing, and robust QA/Retrieval/NER benchmarks, and assesses both text-only and multi-modal inputs. Byte-based models like ByT5 outperform subword models on ULs, while in-context learning often falls short under limited data, revealing substantial headroom for UL improvements. The work offers practical guidelines on data splits, pre-training data management, and avoiding opaque MT APIs, aiming to accelerate usable NLP for UL communities.

Abstract

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

TL;DR

hour per-UL limit, provides new data for OCR, autocomplete, transliteration, semantic parsing, and robust QA/Retrieval/NER benchmarks, and assesses both text-only and multi-modal inputs. Byte-based models like ByT5 outperform subword models on ULs, while in-context learning often falls short under limited data, revealing substantial headroom for UL improvements. The work offers practical guidelines on data splits, pre-training data management, and avoiding opaque MT APIs, aiming to accelerate usable NLP for UL communities.

Abstract

Paper Structure (114 sections, 2 figures, 12 tables)

This paper contains 114 sections, 2 figures, 12 tables.

Introduction
Related Work
Multilingual benchmarks
Multilingual evaluation
Xtreme-Up
Design Principles
Under-represented languages
User-centric tasks
Scarce data
Efficiency
Text-centric, yet multi-modal
How much data?
Input / Output Tasks
Automatic speech recognition (ASR; \ref{['app:asr']})
Optical character recognition (OCR; \ref{['app:ocr']})
...and 99 more sections

Figures (2)

Figure 1: The tasks in Xtreme-Up and their role in language technology. Left: enabling access to language technology; middle: facilitating information access as part of larger systems (question answering, information extraction, virtual assistants); right: making information accessible in the speaker's language.
Figure 2: Creation of a linearized query from the actual query and its parse for semantic parsing.

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

TL;DR

Abstract

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (2)