Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Xiaochen Li; Zheng-Xin Yong; Stephen H. Bach

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach

TL;DR

The paper addresses multilingual toxicity in LLMs and the challenge of detoxifying non-English outputs without language-specific data. It demonstrates zero-shot cross-lingual detoxification by training with English-only Direct Preference Optimization (DPO) data, achieving substantial reductions in toxicity across 17 languages and multiple models. A mechanistic analysis reveals dual multilinguality of MLP substructures—toxic value vectors are multilingual and toxic key vectors respond across languages, with DPO suppressing their activations. The work further shows that bilingual sentence retrieval accuracy strongly predicts cross-lingual transferability, offering a practical predictor for generalization. These findings suggest scalable safety improvements for multilingual LLMs, while noting limits for very low-resource languages.

Abstract

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

TL;DR

Abstract

Paper Structure (35 sections, 3 equations, 15 figures, 9 tables)

This paper contains 35 sections, 3 equations, 15 figures, 9 tables.

Introduction
Related Work
Cross-lingual generalization of RLHF/RLAIF
Multilingual toxicity evaluation and mitigation
Safety-specific regions in LLMs
Cross-lingual Toxicity Mitigation
Multilingual Toxicity Evaluation
Evaluation dataset
Metrics
Toxicity
Fluency
Diversity
Results
Mechanism
Preliminaries
...and 20 more sections

Figures (15)

Figure 1: Safety preference tuning on English (en) pairwise toxic/non-toxic data reduces mGPT's shliazhko2024mgpt probability in generating toxic continuations (\ref{['fig:mgpt_dpo_result_toxprob']}) and the expected toxicity level in its most-toxic generations (\ref{['fig:mgpt_dpo_result_emt']}) across 17 different languages. We report results averaged over 5 seeds DPO training rafailov2023dpo.
Figure 2: Intervention with negative offsets on all 36 neurons' activations from the actual sources of toxicity reduces average toxicity level across 17 different languages. Experiments are done with greedy decoding.
Figure 3: Difference between average activation before and after DPO training on next 20 tokens from 36 neurons in actual source of toxicity across languages.
Figure 4: Strong positive correlation (Pearson-r = 0.732, p < 0.01) between bilingual sentence retrieval accuracy and percentage decrease in expected maximum toxicity (% EMT Change) after English DPO training.
Figure 5: Toxicity reduction of BLOOM-1.7B workshop2022bloom after DPO training.
...and 10 more figures

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

TL;DR

Abstract

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (15)