Table of Contents
Fetching ...

Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

Xintong Wang, Yixiao Liu, Jingheng Pan, Liang Ding, Longyue Wang, Chris Biemann

TL;DR

This work introduces ToxiRewriteCN, the first Chinese detoxification dataset designed to preserve sentiment polarity, capturing 1,556 toxic-to-non-toxic rewrite triplets across standard, emoji/homophone, and conversational contexts. A six-step human-in-the-loop pipeline combines model-based drafts (via Qwen-Max) with thorough human correction and cross-verification to ensure polarity-consistent rewrites and fine-grained toxic spans. The authors benchmark 17 LLMs spanning closed-source, open-source dense, and MoE architectures, revealing that while larger models excel at detoxification, maintaining the original emotional tone remains challenging, especially in emoji/homophone and multi-turn dialogue scenarios. The study provides nuanced, scenario-specific insights and releases ToxiRewriteCN to foster sentiment-aware detoxification research in Chinese, with implications for safer and more expressive moderation in multilingual settings.

Abstract

Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.

Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

TL;DR

This work introduces ToxiRewriteCN, the first Chinese detoxification dataset designed to preserve sentiment polarity, capturing 1,556 toxic-to-non-toxic rewrite triplets across standard, emoji/homophone, and conversational contexts. A six-step human-in-the-loop pipeline combines model-based drafts (via Qwen-Max) with thorough human correction and cross-verification to ensure polarity-consistent rewrites and fine-grained toxic spans. The authors benchmark 17 LLMs spanning closed-source, open-source dense, and MoE architectures, revealing that while larger models excel at detoxification, maintaining the original emotional tone remains challenging, especially in emoji/homophone and multi-turn dialogue scenarios. The study provides nuanced, scenario-specific insights and releases ToxiRewriteCN to foster sentiment-aware detoxification research in Chinese, with implications for safer and more expressive moderation in multilingual settings.

Abstract

Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.

Paper Structure

This paper contains 29 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of three outcomes in detoxifying toxic Chinese sentences: (1) blocked by rule-based filters, (2) overly polite rewrites that distort user intent, and (3) sentiment-aligned detoxification that preserves emotional tone while removing toxicity.
  • Figure 2: Overview of the human-in-the-loop annotation pipeline. The process consists of three stages: (1)Data Filtering, where candidate toxic samples are selected and verified for rewrite suitability; (2)Rewrite with Sentiment Polarity, where LLMs perform coarse rewriting followed by human correction; and (3)Cross-verification, where annotations are validated. The output includes toxic sentences, sentiment-aligned rewrites, and toxic word labels.
  • Figure 3: Human post-correction interface. Annotators are shown the toxic sentence and the coarse rewrite. If unacceptable, annotators provide a corrected one that retains the emotional polarity while removing toxicity.
  • Figure 4: Comparison of four model variants (Generation, Reasoning, MOE, and Dense) across different evaluation scenarios: overall, single-sentence, emoji, homophone, single-turn conversation, and multi-turn conversation. Each chart visualizes performance on six metrics: Detox-CLS, Detox-Clean, Fluency, Content Preservation, Neutral Polarity, and Polite Polarity.
  • Figure 5: Distribution of toxic data in the ToxiRewriteCN dataset. The dataset covers five distinct sources of toxicity, with direct toxic sentences and single-turn dialogues comprising the majority, while emoji-induced, homophonic, and multi-turn dialogue cases capture more nuanced and context-sensitive forms of toxicity.
  • ...and 2 more figures