Table of Contents
Fetching ...

Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

Kyomin Hwang, Hyeonjin Kim, Seungyeon Kim, Sunghyun Wee, Nojun Kwak

TL;DR

This work shows that English-only unlearning in multilingual LLMs can trigger language confusion, causing traditional reference-based metrics to falsely indicate forgetting. It introduces the N-gram Language-Mix (N-Mix) score to quantify cross-language responses and demonstrates that multilingual LLMs exhibit substantial language confusion under English-only unlearning. The authors advocate semantic-based evaluation, validated with ChatGPT, to assess forgetting content across languages and demonstrate that multilingual data during unlearning mitigates but does not fully solve the issue. The study highlights the need for content-focused, language-robust evaluation methods and points to multilingual data strategies as a practical mitigation, while acknowledging current limitations and avenues for future work.

Abstract

There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.

Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

TL;DR

This work shows that English-only unlearning in multilingual LLMs can trigger language confusion, causing traditional reference-based metrics to falsely indicate forgetting. It introduces the N-gram Language-Mix (N-Mix) score to quantify cross-language responses and demonstrates that multilingual LLMs exhibit substantial language confusion under English-only unlearning. The authors advocate semantic-based evaluation, validated with ChatGPT, to assess forgetting content across languages and demonstrate that multilingual data during unlearning mitigates but does not fully solve the issue. The study highlights the need for content-focused, language-robust evaluation methods and points to multilingual data strategies as a practical mitigation, while acknowledging current limitations and avenues for future work.

Abstract

There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.

Paper Structure

This paper contains 46 sections, 9 equations, 9 figures, 35 tables.

Figures (9)

  • Figure 1: Unlearning results for a multilingual LLM using an English-only dataset. Language confusion hinders conventional reference-based metrics inadequate for accurately measuring unlearning performance.
  • Figure 2: An overview of the dataset generation pipeline. (a) A single profile contains a name and seven classes of attributes. (b) The generated profiles are combined with QA templates to create the final multilingual QA dataset.
  • Figure 3: Examples of language confusion after applying English-only unlearning to Qwen2 with$\mathcal{D}^{\mathrm{en}}$.
  • Figure 4: Validation results of the N-Mix score. The vertical axis represents the base language, while the horizontal axis shows the input language. Each value indicates the corresponding N-Mix score, where higher scores (closer to 100) suggest that the input language differs from the base language.
  • Figure 5: Given a sentence $s_a$ written in the language on the $x$-axis (ground truth language), another sentence $s_b$ holding the same context but in the language on the $y$-axis (comparison language) is compared. Each cell reports the percentage of cases in which ChatGPT judged that the two sentences $s_a$ and $s_b$ are non-equivalent.
  • ...and 4 more figures