Table of Contents
Fetching ...

Something Just Like TRuST : Toxicity Recognition of Span and Target

Berk Atil, Namrata Sureddy, Rebecca J. Passonneau

TL;DR

TRuST introduces a unified, large-scale toxicity benchmark by merging prior resources into ~298k labeled items with binary toxicity, 8 high-level target groups, 24 subgroups, and token-level toxic spans. A 11k-example human-annotated subset provides reliable inter-annotator consistency, while automated labels enable scalable expansion. Benchmarking shows fine-tuned PLMs outperform LLMs across toxicity, targets, and spans, with automated prompting and reasoning offering limited gains, highlighting the need for improved social reasoning in models. TRuST thus offers a robust, comprehensive resource for evaluating and mitigating toxicity in socially-aware language technologies and supports generalization to external moderation datasets.

Abstract

Toxic language includes content that is offensive, abusive, or that promotes harm. Progress in preventing toxic output from large language models (LLMs) is hampered by inconsistent definitions of toxicity. We introduce TRuST, a large-scale dataset that unifies and expands prior resources through a carefully synthesized definition of toxicity, and corresponding annotation scheme. It consists of ~300k annotations, with high-quality human annotation on ~11k. To ensure high-quality, we designed a rigorous, multi-stage human annotation process, and evaluated the diversity of the annotators. Then we benchmarked state-of-the-art LLMs and pre-trained models on three tasks: toxicity detection, identification of the target group, and of toxic words. Our results indicate that fine-tuned PLMs outperform LLMs on the three tasks, and that current reasoning models do not reliably improve performance. TRuST constitutes one of the most comprehensive resources for evaluating and mitigating LLM toxicity, and other research in socially-aware and safer language technologies.

Something Just Like TRuST : Toxicity Recognition of Span and Target

TL;DR

TRuST introduces a unified, large-scale toxicity benchmark by merging prior resources into ~298k labeled items with binary toxicity, 8 high-level target groups, 24 subgroups, and token-level toxic spans. A 11k-example human-annotated subset provides reliable inter-annotator consistency, while automated labels enable scalable expansion. Benchmarking shows fine-tuned PLMs outperform LLMs across toxicity, targets, and spans, with automated prompting and reasoning offering limited gains, highlighting the need for improved social reasoning in models. TRuST thus offers a robust, comprehensive resource for evaluating and mitigating toxicity in socially-aware language technologies and supports generalization to external moderation datasets.

Abstract

Toxic language includes content that is offensive, abusive, or that promotes harm. Progress in preventing toxic output from large language models (LLMs) is hampered by inconsistent definitions of toxicity. We introduce TRuST, a large-scale dataset that unifies and expands prior resources through a carefully synthesized definition of toxicity, and corresponding annotation scheme. It consists of ~300k annotations, with high-quality human annotation on ~11k. To ensure high-quality, we designed a rigorous, multi-stage human annotation process, and evaluated the diversity of the annotators. Then we benchmarked state-of-the-art LLMs and pre-trained models on three tasks: toxicity detection, identification of the target group, and of toxic words. Our results indicate that fine-tuned PLMs outperform LLMs on the three tasks, and that current reasoning models do not reliably improve performance. TRuST constitutes one of the most comprehensive resources for evaluating and mitigating LLM toxicity, and other research in socially-aware and safer language technologies.

Paper Structure

This paper contains 48 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Histogram of distinct tokens within toxic spans (x-axis) ordered by total count (y-axis). The vertical "tear" at around 150 on the x-axis shows there is a long tail. The 25 most common words are shown descending on a diagonal.
  • Figure 2: Few-Shot Comparison for Zero to 6-Shot across Four Tasks with Four Models.
  • Figure 3: F1 scores of the models on the span data labeled as "all sentence" (x-axis) vs others (specific spans are found, y-axis). Reasoning models, few-shot prompted models, and zero-shot models are labeled with a different color.
  • Figure 4: Accuracy For each Higher Target Group
  • Figure 5: Target group prediction accuracy for each target group
  • ...and 1 more figures