PrOnto: Language Model Evaluations for 859 Languages
Luke Gessler
TL;DR
PrOnto tackles the scarcity of multilingual evaluation data by projecting OntoNotes' New Testament annotations into 859 languages via verse alignment, creating a scalable evaluation resource for pretrained language models. The approach centers on five annotation-projection tasks implemented as sequence classification, evaluated across a diverse set of languages and pretrained models using a standardized training setup. The results show that the projected tasks are meaningful proxies for model quality across languages with varying typological distance from English, and that the resource remains useful for high-, medium-, and low-resource settings. The work further provides a practical pipeline and encourages community contributions to extend the dataset and potentially derive typological distance insights from projection errors.
Abstract
Evaluation datasets are critical resources for measuring the quality of pretrained language models. However, due to the high cost of dataset annotation, these resources are scarce for most languages other than English, making it difficult to assess the quality of language models. In this work, we present a new method for evaluation dataset construction which enables any language with a New Testament translation to receive a suite of evaluation datasets suitable for pretrained language model evaluation. The method critically involves aligning verses with those in the New Testament portion of English OntoNotes, and then projecting annotations from English to the target language, with no manual annotation required. We apply this method to 1051 New Testament translations in 859 and make them publicly available. Additionally, we conduct experiments which demonstrate the efficacy of our method for creating evaluation tasks which can assess language model quality.
