Realistic Evaluation of Toxicity in Large Language Models

Tinh Son Luong; Thanh-Thien Le; Linh Ngo Van; Thien Huu Nguyen

Realistic Evaluation of Toxicity in Large Language Models

Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, Thien Huu Nguyen

TL;DR

This paper introduces the Thoroughly Engineered Toxicity (TET) dataset to enable realistic toxicity evaluation of large language models under real-world and jailbreak-like prompts. By filtering over $1$ million real-world interactions down to $2546$ prompts and evaluating seven prominent LLMs with Perspective API across six toxicity types, the authors demonstrate that TET reveals higher toxicity than prior benchmarks such as ToxiGen. They further compare TET to ToxiGen using a distribution-matching variant (ToxiGen-S), showing TET’s superior effectiveness at eliciting toxic outputs in most models. The work also investigates jailbreak prompts, defense strategies (toxicity classifiers, defensive prompts, training), and model-specific vulnerabilities, highlighting the complexity of safeguarding LLMs in practice. Overall, TET serves as a rigorous, evolving benchmark for AI safety and responsible deployment, with future work planned to expand model coverage and conversation-context evaluation.

Abstract

Large language models (LLMs) have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.

Realistic Evaluation of Toxicity in Large Language Models

TL;DR

This paper introduces the Thoroughly Engineered Toxicity (TET) dataset to enable realistic toxicity evaluation of large language models under real-world and jailbreak-like prompts. By filtering over

million real-world interactions down to

prompts and evaluating seven prominent LLMs with Perspective API across six toxicity types, the authors demonstrate that TET reveals higher toxicity than prior benchmarks such as ToxiGen. They further compare TET to ToxiGen using a distribution-matching variant (ToxiGen-S), showing TET’s superior effectiveness at eliciting toxic outputs in most models. The work also investigates jailbreak prompts, defense strategies (toxicity classifiers, defensive prompts, training), and model-specific vulnerabilities, highlighting the complexity of safeguarding LLMs in practice. Overall, TET serves as a rigorous, evolving benchmark for AI safety and responsible deployment, with future work planned to expand model coverage and conversation-context evaluation.

Abstract

Paper Structure (14 sections, 4 figures, 4 tables)

This paper contains 14 sections, 4 figures, 4 tables.

Introduction
Dataset Construction
Evaluation Settings
Toxicity Evaluation of LLMs
TET versus ToxiGen
Effects of Jailbreaking on Different Models
Conclusions
Appendix
HateBERT and Perspective API
Creation of ToxiGen-S
Additional Jailbreaking Results
Defense Against Toxicity
Some Observations regarding Llama Guard
Example prompts

Figures (4)

Figure 1: Illustration of the general-toxicity score distributions of TET (orange) and ToxiGen-S (blue).
Figure 2: Example of a prompt in TET dataset.
Figure 3: Example of a prompt created using the ToxiGen dataset.
Figure 4: Five of the jailbreak templates in the TET dataset.

Realistic Evaluation of Toxicity in Large Language Models

TL;DR

Abstract

Realistic Evaluation of Toxicity in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)