Table of Contents
Fetching ...

KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification

Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

TL;DR

This work addresses the challenge of toxic content in low-resource languages by introducing KOTOX, a Korean toxicity dataset designed for deobfuscation and detoxification. It unifies three tasks—Obfuscated Toxic Text Classification, Neutral Text Deobfuscation, and Obfuscated Toxic Text Sanitization—and defines 17 transformation rules across five obfuscation classes, producing 6,882 obfuscated neutral–toxic pairs across three difficulty levels. The dataset is built from a filtered K/DA base and evaluated with classification models and instruction-funed LLMs, showing that obfuscation-aware training improves robustness while deobfuscation and sanitization remain challenging. The work advances Korean NLP safety and provides a framework for obfuscation-aware toxicity mitigation with potential extension to other languages.

Abstract

Toxic content has become an increasingly critical social issue with the rapid expansion of online communication. While numerous studies explored methods for detecting and detoxifying such content, most have focused primarily on English, leaving low-resource language underrepresented. Consequently, Large Language Models~(LLMs) often struggle to identify and neutralize toxic expressions in these languages. This challenge becomes even more pronounced when user employ obfuscation techniques to evade detection systems. Therefore, we propose a \textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to address this issue. We categorize various obfuscation approaches based on linguistic characteristics of Korean and define a set of transformation rules grounded in real-word examples. Using these rules, we construct three dataset versions (easy, normal, and hard) representing different levels of obfuscation difficulty. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect it to facilitate better understanding and mitigating of obfuscated toxic content in LLM for low-resource languages. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification

TL;DR

This work addresses the challenge of toxic content in low-resource languages by introducing KOTOX, a Korean toxicity dataset designed for deobfuscation and detoxification. It unifies three tasks—Obfuscated Toxic Text Classification, Neutral Text Deobfuscation, and Obfuscated Toxic Text Sanitization—and defines 17 transformation rules across five obfuscation classes, producing 6,882 obfuscated neutral–toxic pairs across three difficulty levels. The dataset is built from a filtered K/DA base and evaluated with classification models and instruction-funed LLMs, showing that obfuscation-aware training improves robustness while deobfuscation and sanitization remain challenging. The work advances Korean NLP safety and provides a framework for obfuscation-aware toxicity mitigation with potential extension to other languages.

Abstract

Toxic content has become an increasingly critical social issue with the rapid expansion of online communication. While numerous studies explored methods for detecting and detoxifying such content, most have focused primarily on English, leaving low-resource language underrepresented. Consequently, Large Language Models~(LLMs) often struggle to identify and neutralize toxic expressions in these languages. This challenge becomes even more pronounced when user employ obfuscation techniques to evade detection systems. Therefore, we propose a \textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to address this issue. We categorize various obfuscation approaches based on linguistic characteristics of Korean and define a set of transformation rules grounded in real-word examples. Using these rules, we construct three dataset versions (easy, normal, and hard) representing different levels of obfuscation difficulty. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect it to facilitate better understanding and mitigating of obfuscated toxic content in LLM for low-resource languages. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

Paper Structure

This paper contains 88 sections, 11 figures, 25 tables, 1 algorithm.

Figures (11)

  • Figure 1: Example of detecting obfuscated text.
  • Figure 2: Overview of KOTOX construction and targeting tasks.
  • Figure 3: Distribution of obfuscated Rule frequencies in total dataset.
  • Figure 4: Error ratio for each rule. HateBERT is trained and evaluated on the easy datasets. The error ratio indicates the proportion of misclassified samples among the data associated with each rule.
  • Figure 5: The prompt used for phonetic transliteration obfuscation with Latin scripts. It provides the task descriptions and instructions.
  • ...and 6 more figures