From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars
Albert Kornilov, Tatiana Shavrina
TL;DR
The paper tackles the problem of scaling NLP and machine translation to under-resourced languages by extracting typological features from descriptive grammars using Retrieval-Augmented Generation (RAG). It introduces two benchmarks and an open-source RAG pipeline to evaluate how well large language models can read grammars and classify typological features, including WALS and Grambank attributes. Key findings show that traditional BM25 retrieval is competitive with modern embeddings for this domain, and that end-to-end RAG pipelines can outperform baselines, though linguistic descriptions still pose non-trivial challenges and chain-of-thought prompts yield mixed benefits. Overall, the work provides a practical framework and data resources to advance typological extraction and MT for low-resource languages, while highlighting the need for standardized grammars and further methodological improvements.
Abstract
Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first comprehensive evaluation of language models' in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \url{https://github.com/al-the-eigenvalue/RAG-on-grammars}.
