Table of Contents
Fetching ...

LifeTox: Unveiling Implicit Toxicity in Life Advice

Minbeom Kim, Jahyun Koo, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung

TL;DR

It is demonstrated that RoBERTa fine-tuned on \texttt{LifeTox} matches or surpasses the zero-shot performance of large language models in toxicity classification tasks, and underscores the efficacy of \texttt{LifeTox} in addressing the complex challenges inherent in implicit toxicity.

Abstract

As large language models become increasingly integrated into daily life, detecting implicit toxicity across diverse contexts is crucial. To this end, we introduce LifeTox, a dataset designed for identifying implicit toxicity within a broad range of advice-seeking scenarios. Unlike existing safety datasets, LifeTox comprises diverse contexts derived from personal experiences through open-ended questions. Experiments demonstrate that RoBERTa fine-tuned on LifeTox matches or surpasses the zero-shot performance of large language models in toxicity classification tasks. These results underscore the efficacy of LifeTox in addressing the complex challenges inherent in implicit toxicity. We open-sourced the dataset\footnote{\url{https://huggingface.co/datasets/mbkim/LifeTox}} and the LifeTox moderator family; 350M, 7B, and 13B.

LifeTox: Unveiling Implicit Toxicity in Life Advice

TL;DR

It is demonstrated that RoBERTa fine-tuned on \texttt{LifeTox} matches or surpasses the zero-shot performance of large language models in toxicity classification tasks, and underscores the efficacy of \texttt{LifeTox} in addressing the complex challenges inherent in implicit toxicity.

Abstract

As large language models become increasingly integrated into daily life, detecting implicit toxicity across diverse contexts is crucial. To this end, we introduce LifeTox, a dataset designed for identifying implicit toxicity within a broad range of advice-seeking scenarios. Unlike existing safety datasets, LifeTox comprises diverse contexts derived from personal experiences through open-ended questions. Experiments demonstrate that RoBERTa fine-tuned on LifeTox matches or surpasses the zero-shot performance of large language models in toxicity classification tasks. These results underscore the efficacy of LifeTox in addressing the complex challenges inherent in implicit toxicity. We open-sourced the dataset\footnote{\url{https://huggingface.co/datasets/mbkim/LifeTox}} and the LifeTox moderator family; 350M, 7B, and 13B.
Paper Structure (23 sections, 5 figures, 2 tables)

This paper contains 23 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: ULPT user feels stressed by the landlord entering the room without prior notice and is seeking advice to prevent it. ULPT advisor suggests setting traps to deceive the landlord into causing damage, which could be used as a pretext to bar entry. This strategy, embodying manipulation and deceit, justifies its 'unsafe' label.
  • Figure 2: Accuracy of the RoBERTa-LifeTox, Llama-2-Chat-13B, and GPT-3.5 in BeaverTails across different QA length with # words.
  • Figure 3: Pure zero-shot mean Macro-F1 score except for the LifeTox test set. We report the performance of LLMs and LifeTox-trained LLMs at each scale; 350M, 7B, 13B, and 175B (GPT-3.5).
  • Figure 4: An example instruction page shown to Amazon MTurk annotators for human evaluation.
  • Figure 5: Visualization of Topic Distributions in LifeTox