Table of Contents
Fetching ...

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Ivan Vulić

TL;DR

This paper tackles cross-lingual transfer efficiency by learning sparse, composable fine-tunings. It introduces LT-SFT, which uses a Lottery Ticket-inspired two-phase procedure to produce language- and task-specific masks that can be summed with the pretrained model. LT-SFT consistently surpasses MAD-X baselines across 35 languages and four tasks in zero-shot transfer, and its sparsity is shown to reduce interference and overfitting. The authors release code and models to enable reproducibility and further applications of modular sparse fine-tuning.

Abstract

Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

TL;DR

This paper tackles cross-lingual transfer efficiency by learning sparse, composable fine-tunings. It introduces LT-SFT, which uses a Lottery Ticket-inspired two-phase procedure to produce language- and task-specific masks that can be summed with the pretrained model. LT-SFT consistently surpasses MAD-X baselines across 35 languages and four tasks in zero-shot transfer, and its sparsity is shown to reduce interference and overfitting. The authors release code and models to enable reproducibility and further applications of modular sparse fine-tuning.

Abstract

Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.

Paper Structure

This paper contains 20 sections, 3 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: A graphical representation of Lottery Ticket Sparse Fine-Tuning: from the parameters of a pretrained model (gray, left), we generate sparse fine-tunings for task and language knowledge (blue and red, center). Finally, we sum these three components (right) to obtain the adapted/fine-tuned model. Best viewed in color.
  • Figure 2: Zero-shot cross-lingual transfer evaluation of Lottery-Ticket Sparse Fine-Tuning (LT-SFT), Random Sparse Fine-Tuning (rand-SFT), and adapter-based MAD-X over four tasks with varying numbers of trainable parameters during task adaptation. Results are averages over all target languages.
  • Figure 3: Performance of LT-SFT on DP and NER controlling for the sparsity of task and language fine-tuning. Results are averaged over several selected languages. Denser fine-tunings may interfere with each other and consequently degrade the model performance.
  • Figure 4: Zero-shot cross-lingual transfer evaluation of Lottery-Ticket Sparse Fine-Tuning (LT-SFT) and MAD-X when pretrained language adapters from AdapterHub pfeiffer-etal-2020-adapterhub are used during task training and evaluation. These adapters are trained for 250,000 steps with a batch size of 64, as opposed to the 100,000 steps of batch size 8 used in our experiments. LT-SFT nevertheless maintains an edge in performance across all tasks. Since AdapterHub adapters are only available for some of the languages in our evaluation, the results shown are averaged over only the languages for which they are available, indicated in the subfigure captions.
  • Figure 5: Percentage of parameters selected for the sparse fine-tuning of both languages in a pair.