Table of Contents
Fetching ...

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen

TL;DR

The paper tackles toxicity in multilingual LLMs by proposing Cross-Lingual Detoxification ($X$-DET) to transfer detoxification from a source language to seven target languages using parallel toxic-neutral data. It evaluates four fine-tuning paradigms—Zero-Shot ($ZS$), Cross-lingual Fine-Tuning ($X$-$FT$), Percent-based Fine-Tuning ($P$-$FT$), and Multilingual Fine-Tuning ($M$-$FT$)—across four models with a 392-configuration design on a multilingual detox dataset, measuring toxicity with Perspective-API and perplexity. Key findings show stronger detox transfer for languages with script similarity and higher pretraining representation, while distant scripts (e.g., Chinese) exhibit weaker transfer; limited data can still yield effective detoxification, but often at the cost of non-toxic task performance and language perplexity. The work contributes empirical evidence for cross-lingual safety mitigation in mLLMs, guiding safe deployment by highlighting script- and data-related factors, and invites future work on more robust evaluation beyond Perspective-API.

Abstract

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

TL;DR

The paper tackles toxicity in multilingual LLMs by proposing Cross-Lingual Detoxification (-DET) to transfer detoxification from a source language to seven target languages using parallel toxic-neutral data. It evaluates four fine-tuning paradigms—Zero-Shot (), Cross-lingual Fine-Tuning (-), Percent-based Fine-Tuning (-), and Multilingual Fine-Tuning (-)—across four models with a 392-configuration design on a multilingual detox dataset, measuring toxicity with Perspective-API and perplexity. Key findings show stronger detox transfer for languages with script similarity and higher pretraining representation, while distant scripts (e.g., Chinese) exhibit weaker transfer; limited data can still yield effective detoxification, but often at the cost of non-toxic task performance and language perplexity. The work contributes empirical evidence for cross-lingual safety mitigation in mLLMs, guiding safe deployment by highlighting script- and data-related factors, and invites future work on more robust evaluation beyond Perspective-API.

Abstract

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.

Paper Structure

This paper contains 14 sections, 32 figures, 25 tables.

Figures (32)

  • Figure 1: An overview of Cross-lingual Detoxification. (Top) An example where model generates a toxic sentence, and (Bottom) shows the detoxification in German yields neutral generations. Takeaway: Detoxification works effectively in a cross-lingual setting.
  • Figure 2: Toxicity scores for Zero-Shot ($ZS$), Percent-based Fine-Tuning ($P$-$FT$) (10%, 20%, and 30%), Multilingual Fine-Tuning ($M$-$FT$ or 100%) for aya-23-8B over the toxic-train, toxic-test, and neutral-test evaluation set. Takeaway: Indo-European languages tend to show higher toxicity mitigation than Non-Indo-European langauges.
  • Figure 3: Average$\Delta$-Toxicity scores for $P$-$FT$vs$M$-$FT$ for aya-23-8B over the toxic-train all-languages evaluation set. Takeaway: "ar" showed a similar trend to "es" and "en".
  • Figure 4: Toxicity scores for $ZS$, $X$-$FT$, $P$-$FT$, and $M$-$FT$ for aya-expanse-8B over all three evaluation sets. Takeaway: Similar script family has shown similar behavior.
  • Figure 5: Toxicity scores for $ZS$, $P$-$FT$, and $M$-$FT$ for mt5-large over all three evaluation sets. Takeaway: All the languages have shown significant low detoxification scores.
  • ...and 27 more figures