Composable Sparse Fine-Tuning for Cross-Lingual Transfer

Alan Ansell; Edoardo Maria Ponti; Anna Korhonen; Ivan Vulić

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Ivan Vulić

TL;DR

This paper tackles cross-lingual transfer efficiency by learning sparse, composable fine-tunings. It introduces LT-SFT, which uses a Lottery Ticket-inspired two-phase procedure to produce language- and task-specific masks that can be summed with the pretrained model. LT-SFT consistently surpasses MAD-X baselines across 35 languages and four tasks in zero-shot transfer, and its sparsity is shown to reduce interference and overfitting. The authors release code and models to enable reproducibility and further applications of modular sparse fine-tuning.

Abstract

Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

TL;DR

Abstract

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)