Table of Contents
Fetching ...

MELA: Multilingual Evaluation of Linguistic Acceptability

Ziyin Zhang, Yikang Liu, Weifang Huang, Junyu Mao, Rui Wang, Hai Hu

TL;DR

MELA introduces the first large-scale multilingual benchmark for linguistic acceptability, with 46k labeled sentences across 10 languages, enabling cross-lingual analysis and syntax probing. The authors benchmark a range of LLMs and XLM-R, finding that GPT-4o shows superior multilingual performance, while in-language prompting significantly boosts few-shot results; cross-lingual transfer remains non-trivial and data-size effects are nuanced. They also demonstrate that fine-tuning XLM-R on MELA enhances syntax-related representations via edge probing, suggesting that acceptability training fosters syntactic knowledge. The dataset fills a gap in multilingual linguistic evaluation and provides a resource for further cross-lingual, syntactic, and interpretability research, with data available at https://github.com/sjtu-compling/MELA.

Abstract

In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks. Our data is available at https://github.com/sjtu-compling/MELA.

MELA: Multilingual Evaluation of Linguistic Acceptability

TL;DR

MELA introduces the first large-scale multilingual benchmark for linguistic acceptability, with 46k labeled sentences across 10 languages, enabling cross-lingual analysis and syntax probing. The authors benchmark a range of LLMs and XLM-R, finding that GPT-4o shows superior multilingual performance, while in-language prompting significantly boosts few-shot results; cross-lingual transfer remains non-trivial and data-size effects are nuanced. They also demonstrate that fine-tuning XLM-R on MELA enhances syntax-related representations via edge probing, suggesting that acceptability training fosters syntactic knowledge. The dataset fills a gap in multilingual linguistic evaluation and provides a resource for further cross-lingual, syntactic, and interpretability research, with data available at https://github.com/sjtu-compling/MELA.

Abstract

In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks. Our data is available at https://github.com/sjtu-compling/MELA.
Paper Structure (43 sections, 5 figures, 9 tables)

This paper contains 43 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Performance of XLM-R when fine-tuned on different languages. The horizontal axis indicates the number of training samples. For example, for "all" curves, the point at 500 indicates the model is trained on 500 sentences, with 50 from each language. For "All-but-in-lang." curves, the point at 495 indicates the model is trained on 495 sentences, with 55 from each of the nine languages except the one being evaluated on.
  • Figure 2: Prompt selection results. We experiment with 4 prompts adapted from previous CoLA-prompts from promptsource and lm-evaluation-harness.
  • Figure 3: Average performance across languages with different numbers of in-context examples. We average the MCC and report standard deviations over 5 seeds. Gray bands denote standard deviations.
  • Figure 4: Prompt used for evaluating LLMs.
  • Figure 5: Interrun variance when finetuning XLM-R on English (first row) and Chinese (second row) training data. Each subfigure plots the validation MCC of seven runs with different random seeds on one language. After taking the median of these seven runs, this variance is mitigated to a large extent.