Table of Contents
Fetching ...

Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models

Daria Kryvosheieva, Roger Levy

TL;DR

The paper investigates how multilingual Transformer language models acquire syntactic generalizations in three low-resource languages (Basque, Hindi, Swahili) using targeted syntactic evaluation with synthetic minimal pairs and human validation. It tests five open-access LMs and finds language- and phenomenon-specific patterns: Hindi split ergativity is largely captured, Basque auxiliary agreement is generally learned but hampered by indirect objects, and Swahili noun class agreement remains the most challenging. The study also documents biases (e.g., mBERT's habitual aspect bias) and underperformance of the XGLM4.5B model, likely due to upsampling gaps and the curse of multilinguality, while showing that model size often correlates with better performance. Overall, the results highlight both the promise and limitations of current multilingual LMs for syntactic generalization in low-resource languages and offer direction for improving data and architectures to support broader linguistic coverage.

Abstract

Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs' capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.

Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models

TL;DR

The paper investigates how multilingual Transformer language models acquire syntactic generalizations in three low-resource languages (Basque, Hindi, Swahili) using targeted syntactic evaluation with synthetic minimal pairs and human validation. It tests five open-access LMs and finds language- and phenomenon-specific patterns: Hindi split ergativity is largely captured, Basque auxiliary agreement is generally learned but hampered by indirect objects, and Swahili noun class agreement remains the most challenging. The study also documents biases (e.g., mBERT's habitual aspect bias) and underperformance of the XGLM4.5B model, likely due to upsampling gaps and the curse of multilinguality, while showing that model size often correlates with better performance. Overall, the results highlight both the promise and limitations of current multilingual LMs for syntactic generalization in low-resource languages and offer direction for improving data and architectures to support broader linguistic coverage.

Abstract

Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs' capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.

Paper Structure

This paper contains 24 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Accuracy scores of the models (vertical axis) on our test suites (horizontal axis). In each cell, the bolded value denotes the fraction of minimal pairs in which the model selected the grammatical target, while values in parentheses denote the left and right 95% confidence intervals. The expectation for random guessing is 0.5.
  • Figure 2: Accuracy as a function of parameter count for each model family and test suite.
  • Figure 3: Accuracy as a function of the complexity of the intervening constituent for Hindi and Swahili test suites. For models available in multiple versions, we show mean accuracy over versions; error bars denote 95% confidence intervals on the standard error of the mean.