XNLIeu: a dataset for cross-lingual NLI in Basque
Maite Heredia, Julen Etxaniz, Muitze Zulaika, Xabier Saralegi, Jeremy Barnes, Aitor Soroa
TL;DR
This work introduces XNLIeu, a Basque cross-lingual NLI dataset created by translating English XNLI into Basque and applying professional post-edition, complemented by a machine-translated variant and a native Basque test set. The authors evaluate multiple discriminative and generative models under zero-shot, translate-train, and prompting paradigms, revealing that post-edition improves data reliability and that translate-train typically yields the strongest cross-lingual transfer, though the advantage diminishes on native data. They provide a comprehensive analysis of biases, artifacts, and per-label performance, showing that translation-based datasets exhibit artifacts (e.g., negation cues) that native data mitigates. The native Basque test set demonstrates the importance of dataset origin on evaluation and supports Basque language resource development by offering publicly available benchmarks and baselines for future research.
Abstract
XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.
