Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
TL;DR
The paper tackles toxicity in multilingual LLMs by proposing Cross-Lingual Detoxification ($X$-DET) to transfer detoxification from a source language to seven target languages using parallel toxic-neutral data. It evaluates four fine-tuning paradigms—Zero-Shot ($ZS$), Cross-lingual Fine-Tuning ($X$-$FT$), Percent-based Fine-Tuning ($P$-$FT$), and Multilingual Fine-Tuning ($M$-$FT$)—across four models with a 392-configuration design on a multilingual detox dataset, measuring toxicity with Perspective-API and perplexity. Key findings show stronger detox transfer for languages with script similarity and higher pretraining representation, while distant scripts (e.g., Chinese) exhibit weaker transfer; limited data can still yield effective detoxification, but often at the cost of non-toxic task performance and language perplexity. The work contributes empirical evidence for cross-lingual safety mitigation in mLLMs, guiding safe deployment by highlighting script- and data-related factors, and invites future work on more robust evaluation beyond Perspective-API.
Abstract
As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.
