Table of Contents
Fetching ...

MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, Sebastian Ruder

TL;DR

MAD-X presents a modular, adapter-based framework to overcome capacity limits in pretrained multilingual models by learning language adapters, task adapters, and invertible adapters, enabling efficient cross-lingual transfer to languages unseen during pretraining. By training language adapters with MLM on unlabelled data and keeping the base model frozen, MAD-X supports zero-shot transfer via adapter swapping, while task adapters capture task-specific knowledge. Invertible adapters address vocabulary misalignment between languages without expanding the token embeddings budget. Across NER, CCR, and QA, MAD-X yields substantial improvements over strong baselines, particularly for low-resource and unseen languages, while remaining parameter-efficient and model-agnostic, with code and adapters available on AdapterHub.ml.

Abstract

The main goal behind state-of-the-art pre-trained multilingual models such as multilingual BERT and XLM-R is enabling and bootstrapping NLP applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. However, due to limited model capacity, their transfer performance is the weakest exactly on such low-resource languages and languages unseen during pre-training. We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations. In addition, we introduce a novel invertible adapter architecture and a strong baseline method for adapting a pre-trained multilingual model to a new language. MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and causal commonsense reasoning, and achieves competitive results on question answering. Our code and adapters are available at AdapterHub.ml

MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer

TL;DR

MAD-X presents a modular, adapter-based framework to overcome capacity limits in pretrained multilingual models by learning language adapters, task adapters, and invertible adapters, enabling efficient cross-lingual transfer to languages unseen during pretraining. By training language adapters with MLM on unlabelled data and keeping the base model frozen, MAD-X supports zero-shot transfer via adapter swapping, while task adapters capture task-specific knowledge. Invertible adapters address vocabulary misalignment between languages without expanding the token embeddings budget. Across NER, CCR, and QA, MAD-X yields substantial improvements over strong baselines, particularly for low-resource and unseen languages, while remaining parameter-efficient and model-agnostic, with code and adapters available on AdapterHub.ml.

Abstract

The main goal behind state-of-the-art pre-trained multilingual models such as multilingual BERT and XLM-R is enabling and bootstrapping NLP applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. However, due to limited model capacity, their transfer performance is the weakest exactly on such low-resource languages and languages unseen during pre-training. We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations. In addition, we introduce a novel invertible adapter architecture and a strong baseline method for adapting a pre-trained multilingual model to a new language. MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and causal commonsense reasoning, and achieves competitive results on question answering. Our code and adapters are available at AdapterHub.ml

Paper Structure

This paper contains 18 sections, 5 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: The MAD-X framework inside a Transformer model. Input embeddings are fed into the invertible adapter whose inverse is fed into the tied output embeddings. Language and task adapters are added to each Transformer layer. Language adapters and invertible adapters are trained via masked language modelling (MLM) while the pretrained multilingual model is kept frozen. Task-specific adapters are stacked on top of source language adapters when training on a downstream task such as NER (full lines). During zero-shot cross-lingual transfer, source language adapters are replaced with target language adapters (dashed lines).
  • Figure 2: The invertible adapter (a) and its inverse (b). The input is split and transformed by projections $F$ and $G$, which are coupled in an alternating fashion. $|$ indicates the splitting of the input vector, and $[\,\,]$ indicates the concatenation of two vectors. $+$ and $-$ indicate element-wise addition and subtraction, respectively.
  • Figure 3: Relative $F_1$ improvement of MAD-X$^{Base}$ over XLM-R$^{Base}$ in cross-lingual NER transfer.
  • Figure 4: Cross-lingual NER performance of MAD-X transferring from English to the target languages with invertible and language adapters trained on target language data for different numbers of iterations. Shaded regions denote variance in $F_1$ scores across 5 runs.
  • Figure 5: Mean F1 scores of XLM-R$^{Base}$ in the standard setting (XLM-R$^{Base}$) for cross-lingual transfer on NER.
  • ...and 13 more figures