Table of Contents
Fetching ...

Cross-Lingual Transfer for Low-Resource Natural Language Processing

Iker García-Ferrero

TL;DR

The thesis addresses the persistent data and resource gap in NLP for low-resource languages by advancing cross-lingual transfer for sequence labeling tasks, notably NER, OTE, and Argument Mining. It advances data-based transfer via T-Projection, a two-step annotation-projection method that leverages text-to-text multilingual models and MT, and model-based transfer through constrained decoding to improve zero-shot performance of text-to-text models, complemented by the first open-source multilingual medical model, Medical mT5. The work demonstrates state-of-the-art results across multiple languages and tasks, including robust zero-shot gains and strong extrinsic improvements when data are generated via projection, especially in low-resource domains like African languages. It provides publicly available resources (code, datasets, models) and a comprehensive evaluation framework that highlights both the potential and challenges of cross-lingual transfer, with clear implications for deploying NLP tools in under-resourced languages and domains such as biomedicine.

Abstract

Natural Language Processing (NLP) has seen remarkable advances in recent years, particularly with the emergence of Large Language Models that have achieved unprecedented performance across many tasks. However, these developments have mainly benefited a small number of high-resource languages such as English. The majority of languages still face significant challenges due to the scarcity of training data and computational resources. To address this issue, this thesis focuses on cross-lingual transfer learning, a research area aimed at leveraging data and models from high-resource languages to improve NLP performance for low-resource languages. Specifically, we focus on Sequence Labeling tasks such as Named Entity Recognition, Opinion Target Extraction, and Argument Mining. The research is structured around three main objectives: (1) advancing data-based cross-lingual transfer learning methods through improved translation and annotation projection techniques, (2) developing enhanced model-based transfer learning approaches utilizing state-of-the-art multilingual models, and (3) applying these methods to real-world problems while creating open-source resources that facilitate future research in low-resource NLP. More specifically, this thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method that leverages text-to-text multilingual models and machine translation systems. T-Projection significantly outperforms previous annotation projection methods by a wide margin. For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings using text-to-text models. Finally, we develop Medical mT5, the first multilingual text-to-text medical model, demonstrating the practical impact of our research on real-world applications.

Cross-Lingual Transfer for Low-Resource Natural Language Processing

TL;DR

The thesis addresses the persistent data and resource gap in NLP for low-resource languages by advancing cross-lingual transfer for sequence labeling tasks, notably NER, OTE, and Argument Mining. It advances data-based transfer via T-Projection, a two-step annotation-projection method that leverages text-to-text multilingual models and MT, and model-based transfer through constrained decoding to improve zero-shot performance of text-to-text models, complemented by the first open-source multilingual medical model, Medical mT5. The work demonstrates state-of-the-art results across multiple languages and tasks, including robust zero-shot gains and strong extrinsic improvements when data are generated via projection, especially in low-resource domains like African languages. It provides publicly available resources (code, datasets, models) and a comprehensive evaluation framework that highlights both the potential and challenges of cross-lingual transfer, with clear implications for deploying NLP tools in under-resourced languages and domains such as biomedicine.

Abstract

Natural Language Processing (NLP) has seen remarkable advances in recent years, particularly with the emergence of Large Language Models that have achieved unprecedented performance across many tasks. However, these developments have mainly benefited a small number of high-resource languages such as English. The majority of languages still face significant challenges due to the scarcity of training data and computational resources. To address this issue, this thesis focuses on cross-lingual transfer learning, a research area aimed at leveraging data and models from high-resource languages to improve NLP performance for low-resource languages. Specifically, we focus on Sequence Labeling tasks such as Named Entity Recognition, Opinion Target Extraction, and Argument Mining. The research is structured around three main objectives: (1) advancing data-based cross-lingual transfer learning methods through improved translation and annotation projection techniques, (2) developing enhanced model-based transfer learning approaches utilizing state-of-the-art multilingual models, and (3) applying these methods to real-world problems while creating open-source resources that facilitate future research in low-resource NLP. More specifically, this thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method that leverages text-to-text multilingual models and machine translation systems. T-Projection significantly outperforms previous annotation projection methods by a wide margin. For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings using text-to-text models. Finally, we develop Medical mT5, the first multilingual text-to-text medical model, demonstrating the practical impact of our research on real-world applications.

Paper Structure

This paper contains 130 sections, 3 equations, 45 figures, 29 tables.

Figures (45)

  • Figure \thelstlisting: Modern LLMs, which support text, image, and other multimodal representations, have achieved outstanding performance in a wide range of NLP tasks. They have been applied in many real-world applications.
  • Figure \thelstlisting: Illustration of the Named Entity Recognition (NER) sequence labelling task. The goal is to identify and classify named entities in running text.
  • Figure \thelstlisting: Illustration of multilingual embeddings, where two languages are mapped into a shared vector space. Words with similar meanings are placed close together.
  • Figure \thelstlisting: Representation of the BERT architecture. During training, BERT learns to predict missing words in a sentence based on the contextual representations produced by the model.
  • Figure \thelstlisting: Representation of the text-to-text framework in T5. Every task is framed as a text input and the model is trained to generate the desired output as text. Figure reproduced from DBLP:journals/jmlr/RaffelSRLNMZLL20-T5.
  • ...and 40 more figures