Sequence-to-Sequence Spanish Pre-trained Language Models
Vladimir Araujo, Maria Mihaela Trusca, Rodrigo Tufiño, Marie-Francine Moens
TL;DR
This work addresses the scarcity of encoder–decoder sequence-to-sequence models for Spanish by pre-training and evaluating Spanish versions of BART (BARTO), T5 (T5S), and BERT2BERT-style architectures on extensive Spanish corpora. The authors systematically pre-train on OSCAR, mC4-es, and SUC data with document-level processing and robust data cleansing, then fine-tune on a broad suite of generative and discriminative tasks, including long-form summarization, split-and-rephrase, generative QA, dialogue, machine translation, and GLUES/SQAC benchmarks. Empirical results show that BARTO and T5S consistently top performance across most generative tasks, with T5S excelling in long-form and QA scenarios and BARTO delivering strong translations and dialogue. The work also provides a publicly available suite of models and establishes a foundation for future expansions towards larger Spanish seq2seq models and more diverse, domain-specific datasets, enabling broader Spanish NLP applications and better cross-lingual comparisons.
Abstract
In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.
