Table of Contents
Fetching ...

FUN with Fisher: Improving Generalization of Adapter-Based Cross-lingual Transfer with Scheduled Unfreezing

Chen Cecilia Liu, Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych

TL;DR

The paper tackles the challenge of generalization in zero-shot cross-lingual transfer when fine-tuning large multilingual models is constrained by parameter efficiency. It evaluates scheduled unfreezing methods—originally proposed to mitigate catastrophic forgetting—for training task adapters within the MAD-X framework, showing they close the gap to full fine-tuning and can even outperform it in some cases. By analyzing learning dynamics with the Fisher Information trace $tr(F)$, the work demonstrates that scheduled unfreezing alters optimization trajectories in ways that correlate with cross-lingual performance, and it introduces FUN, a $tr(F)$-based automatic scheduler that matches heuristic methods across multiple datasets and adapter types. The findings offer a theory-informed lens on parameter-efficient transfer and provide practical scheduling strategies to boost cross-lingual generalization in adapters, with implications for broader adapter-based and LoRA-style setups.

Abstract

Standard fine-tuning of language models typically performs well on in-distribution data, but suffers with generalization to distribution shifts. In this work, we aim to improve the generalization of adapter-based cross-lingual task transfer where such cross-language distribution shifts are imminent. We investigate scheduled unfreezing algorithms -- originally proposed to mitigate catastrophic forgetting in transfer learning -- for fine-tuning task adapters. Our experiments show that scheduled unfreezing methods close the gap to full fine-tuning and achieve stronger cross-lingual transfer performance, suggesting that these methods can go beyond just mitigating catastrophic forgetting. Next, aiming to understand these empirical findings, we investigate the learning dynamics of scheduled unfreezing using Fisher Information. Our experiments reveal that scheduled unfreezing induces different learning dynamics compared to standard fine-tuning, and provide evidence that the dynamics of Fisher Information during training correlate with cross-lingual generalization performance. We additionally propose a general scheduled unfreezing algorithm that achieves an average of 2 points improvement over four datasets compared to standard fine-tuning and provides empirical evidence for a theory-based justification of the heuristic unfreezing schedule for adapter training.

FUN with Fisher: Improving Generalization of Adapter-Based Cross-lingual Transfer with Scheduled Unfreezing

TL;DR

The paper tackles the challenge of generalization in zero-shot cross-lingual transfer when fine-tuning large multilingual models is constrained by parameter efficiency. It evaluates scheduled unfreezing methods—originally proposed to mitigate catastrophic forgetting—for training task adapters within the MAD-X framework, showing they close the gap to full fine-tuning and can even outperform it in some cases. By analyzing learning dynamics with the Fisher Information trace , the work demonstrates that scheduled unfreezing alters optimization trajectories in ways that correlate with cross-lingual performance, and it introduces FUN, a -based automatic scheduler that matches heuristic methods across multiple datasets and adapter types. The findings offer a theory-informed lens on parameter-efficient transfer and provide practical scheduling strategies to boost cross-lingual generalization in adapters, with implications for broader adapter-based and LoRA-style setups.

Abstract

Standard fine-tuning of language models typically performs well on in-distribution data, but suffers with generalization to distribution shifts. In this work, we aim to improve the generalization of adapter-based cross-lingual task transfer where such cross-language distribution shifts are imminent. We investigate scheduled unfreezing algorithms -- originally proposed to mitigate catastrophic forgetting in transfer learning -- for fine-tuning task adapters. Our experiments show that scheduled unfreezing methods close the gap to full fine-tuning and achieve stronger cross-lingual transfer performance, suggesting that these methods can go beyond just mitigating catastrophic forgetting. Next, aiming to understand these empirical findings, we investigate the learning dynamics of scheduled unfreezing using Fisher Information. Our experiments reveal that scheduled unfreezing induces different learning dynamics compared to standard fine-tuning, and provide evidence that the dynamics of Fisher Information during training correlate with cross-lingual generalization performance. We additionally propose a general scheduled unfreezing algorithm that achieves an average of 2 points improvement over four datasets compared to standard fine-tuning and provides empirical evidence for a theory-based justification of the heuristic unfreezing schedule for adapter training.
Paper Structure (27 sections, 1 equation, 6 figures, 18 tables, 2 algorithms)

This paper contains 27 sections, 1 equation, 6 figures, 18 tables, 2 algorithms.

Figures (6)

  • Figure 1: a) Standard, b) Gradual unfreezing versus c)$\mathop{\mathrm{tr}}\nolimits(F)$-based scheduled unfreezing for training task adapters in adapter-based cross-lingual transfer. The classifier is not shown and is always trainable. All other components excluding task adapters, such as the original parameters of the base model and language adapters, are always frozen.
  • Figure 2: The relative performance of adapters fine-tuned with scheduled unfreezing (i.e., GU-based and LPFT-based task adapters) and standard fine-tuned task adapters with full fine-tuning of mBERT and XLM-R.
  • Figure 3: Average $\mathop{\mathrm{tr}}\nolimits(F)$ per adapter during standard training versus using gradual unfreezing. Every point on the horizontal axis is 100 training steps for all datasets (except for XCOPA which is 50 steps).
  • Figure 4: Average $\mathop{\mathrm{tr}}\nolimits(F)$ per adapter (normalized between 0-1 to plot together with the validation curve) and validation F1 or accuracy using a randomly sampled schedule. The average results indicated in the legend are the averaged cross-lingual transfer results. a) averaged F1 of MLQA and XQuAD, b) XNLI.
  • Figure 5: Averaged unfreezing schedules for GU and FUN with different base models.
  • ...and 1 more figures