Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

Tu Vu; Aditya Barua; Brian Lester; Daniel Cer; Mohit Iyyer; Noah Constant

Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, Noah Constant

TL;DR

<3-5 sentence high-level summary> This paper studies zero-shot cross-lingual generation for summarization when target-language labels are unavailable, introducing WikiLingua-0 and SP-Rouge as a multilingual evaluation approach. It compares full-model fine-tuning and prompt-tuning across multiple languages and model sizes, revealing catastrophic forgetting and showing that prompt-tuning with larger models can improve non-English generation relative to full fine-tuning. The authors propose two mitigation strategies—mixing in unlabeled multilingual data and factorized prompts—and show they provide additional gains, especially under severe forgetting, though a gap remains to fully supervised baselines. Overall, the work provides a valuable benchmark, empirical insights into cross-lingual transfer, and practical methods to advance robust zero-shot multilingual generation across diverse languages and scripts.

Abstract

In this paper, we explore the challenging problem of performing a generative task in a target language when labeled data is only available in English, using summarization as a case study. We assume a strict setting with no access to parallel data or machine translation and find that common transfer learning approaches struggle in this setting, as a generative multilingual model fine-tuned purely on English catastrophically forgets how to generate non-English. Given the recent rise of parameter-efficient adaptation techniques, we conduct the first investigation into how one such method, prompt tuning (Lester et al., 2021), can overcome catastrophic forgetting to enable zero-shot cross-lingual generation. Our experiments show that parameter-efficient prompt tuning provides gains over standard fine-tuning when transferring between less-related languages, e.g., from English to Thai. However, a significant gap still remains between these methods and fully-supervised baselines. To improve cross-lingual transfer further, we explore several approaches, including: (1) mixing in unlabeled multilingual data, and (2) explicitly factoring prompts into recombinable language and task components. Our approaches can provide further quality gains, suggesting that robust zero-shot cross-lingual generation is within reach.

Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

TL;DR

Abstract

Paper Structure (37 sections, 8 figures, 10 tables)

This paper contains 37 sections, 8 figures, 10 tables.

Introduction
Challenge of zero-shot cross-lingual generation
Problem formulation
Defining WikiLingua-0 zero-shot cross-lingual summarization:
Defining SP-RG for multilingual summarization evaluation:
Experimental setup
Baselines
Lead-64:
trans-train:
trans-test:
sup & sup-all:
Training and implementation details
Results and Discussion
WikiLingua-0 is challenging for both ModelTuning and PromptTuning:
PromptTuning is better on larger language shifts:
...and 22 more sections

Figures (8)

Figure 1: A demonstration of WikiLingua-0, a challenging zero-shot cross-lingual generation (XGen) task, which requires a model to learn a generative task from labeled data in one language (i.e., English), and then perform the equivalent task in another language at inference time.
Figure 2: (a) Zero-shot XGen summarization quality (SP-RG) and (b) target language accuracy (LID$_\mathrm{XX}$) of PromptTuning and ModelTuning models across five model sizes and four target languages: French (Fr), Vietnamese (Vi), Russian (Ru), and Thai (Th). English (En) performance is provided as a point of comparison, but is no longer a zero-shot task. (c) The effect of prompt length on PromptTuning performance at Base and XXL model sizes.
Figure 3: Learning curves showing how PromptTuning (top) and ModelTuning (bottom) progress in terms of summarization quality (left) and unwanted English output (right), at the XXL model size. Note, ModelTuning quality is lower overall, and predictions contain high (>40%) levels of unwanted ASCII.
Figure 4: SP-Rouge scores of our baselines (Lead-64, PromptTuning, ModelTuning) at the XXL model size, in the zero-shot XGen setting. For comparison, we also show the headroom available if a machine translation system is used (trans-train, trans-test), or if gold data in target languages is used (sup, sup-all).
Figure 5: Our "factorized prompts" approach learns recomposable language and task sub-prompts by training on all language / task combinations from a set of unsupervised tasks covering all target languages.
...and 3 more figures

Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

TL;DR

Abstract

Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)