Table of Contents
Fetching ...

XNLIeu: a dataset for cross-lingual NLI in Basque

Maite Heredia, Julen Etxaniz, Muitze Zulaika, Xabier Saralegi, Jeremy Barnes, Aitor Soroa

TL;DR

This work introduces XNLIeu, a Basque cross-lingual NLI dataset created by translating English XNLI into Basque and applying professional post-edition, complemented by a machine-translated variant and a native Basque test set. The authors evaluate multiple discriminative and generative models under zero-shot, translate-train, and prompting paradigms, revealing that post-edition improves data reliability and that translate-train typically yields the strongest cross-lingual transfer, though the advantage diminishes on native data. They provide a comprehensive analysis of biases, artifacts, and per-label performance, showing that translation-based datasets exhibit artifacts (e.g., negation cues) that native data mitigates. The native Basque test set demonstrates the importance of dataset origin on evaluation and supports Basque language resource development by offering publicly available benchmarks and baselines for future research.

Abstract

XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.

XNLIeu: a dataset for cross-lingual NLI in Basque

TL;DR

This work introduces XNLIeu, a Basque cross-lingual NLI dataset created by translating English XNLI into Basque and applying professional post-edition, complemented by a machine-translated variant and a native Basque test set. The authors evaluate multiple discriminative and generative models under zero-shot, translate-train, and prompting paradigms, revealing that post-edition improves data reliability and that translate-train typically yields the strongest cross-lingual transfer, though the advantage diminishes on native data. They provide a comprehensive analysis of biases, artifacts, and per-label performance, showing that translation-based datasets exhibit artifacts (e.g., negation cues) that native data mitigates. The native Basque test set demonstrates the importance of dataset origin on evaluation and supports Basque language resource development by offering publicly available benchmarks and baselines for future research.

Abstract

XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.
Paper Structure (26 sections, 2 figures, 12 tables)

This paper contains 26 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Box plots of the lexical overlap between premises and hypotheses calculated with cosine similarity of the three datasets.
  • Figure 2: Confusion matrices for the XLM-RoBERTa large fine-tuned in Basque, our best model, tested in our three datasets. Best viewed in color.