Cross-Lingual Transfer for Low-Resource Natural Language Processing
Iker García-Ferrero
TL;DR
The thesis addresses the persistent data and resource gap in NLP for low-resource languages by advancing cross-lingual transfer for sequence labeling tasks, notably NER, OTE, and Argument Mining. It advances data-based transfer via T-Projection, a two-step annotation-projection method that leverages text-to-text multilingual models and MT, and model-based transfer through constrained decoding to improve zero-shot performance of text-to-text models, complemented by the first open-source multilingual medical model, Medical mT5. The work demonstrates state-of-the-art results across multiple languages and tasks, including robust zero-shot gains and strong extrinsic improvements when data are generated via projection, especially in low-resource domains like African languages. It provides publicly available resources (code, datasets, models) and a comprehensive evaluation framework that highlights both the potential and challenges of cross-lingual transfer, with clear implications for deploying NLP tools in under-resourced languages and domains such as biomedicine.
Abstract
Natural Language Processing (NLP) has seen remarkable advances in recent years, particularly with the emergence of Large Language Models that have achieved unprecedented performance across many tasks. However, these developments have mainly benefited a small number of high-resource languages such as English. The majority of languages still face significant challenges due to the scarcity of training data and computational resources. To address this issue, this thesis focuses on cross-lingual transfer learning, a research area aimed at leveraging data and models from high-resource languages to improve NLP performance for low-resource languages. Specifically, we focus on Sequence Labeling tasks such as Named Entity Recognition, Opinion Target Extraction, and Argument Mining. The research is structured around three main objectives: (1) advancing data-based cross-lingual transfer learning methods through improved translation and annotation projection techniques, (2) developing enhanced model-based transfer learning approaches utilizing state-of-the-art multilingual models, and (3) applying these methods to real-world problems while creating open-source resources that facilitate future research in low-resource NLP. More specifically, this thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method that leverages text-to-text multilingual models and machine translation systems. T-Projection significantly outperforms previous annotation projection methods by a wide margin. For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings using text-to-text models. Finally, we develop Medical mT5, the first multilingual text-to-text medical model, demonstrating the practical impact of our research on real-world applications.
