Table of Contents
Fetching ...

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Adrian de Wynter, Ishaan Watts, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Nektar Ege Altıntoprak, Lena Baur, Samantha Claudet, Pavel Gajdusek, Can Gören, Qilong Gu, Anna Kaminska, Tomasz Kaminski, Ruby Kuo, Akiko Kyuba, Jongho Lee, Kartik Mathur, Petter Merok, Ivana Milovanović, Nani Paananen, Vesa-Matti Paananen, Anna Pavlenko, Bruno Pereira Vidal, Luciano Strika, Yueh Tsao, Davide Turcato, Oleksandr Vakhno, Judit Velcsov, Anna Vickers, Stéphanie Visser, Herdyan Widarmanto, Andrey Zaikin, Si-Qing Chen

TL;DR

RTP-LX presents a human-annotated, culturally aware multilingual toxicity corpus spanning 28 languages to evaluate S/LLMs as toxicity detectors. The study shows that although several S/LLMs attain decent raw accuracy, their judgments diverge from human annotations, especially for context-dependent harms like microaggressions and bias, highlighting limitations of accuracy as a sole metric. A key contribution is the emphasis on participatory design and transcreation to capture local sensitivities, which improves dataset realism and fairness. The work demonstrates practical implications for deploying multilingual moderation tools and provides a resource for benchmarking and improving safe deployment of S/LLMs in diverse linguistic and cultural contexts.

Abstract

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

TL;DR

RTP-LX presents a human-annotated, culturally aware multilingual toxicity corpus spanning 28 languages to evaluate S/LLMs as toxicity detectors. The study shows that although several S/LLMs attain decent raw accuracy, their judgments diverge from human annotations, especially for context-dependent harms like microaggressions and bias, highlighting limitations of accuracy as a sole metric. A key contribution is the emphasis on participatory design and transcreation to capture local sensitivities, which improves dataset realism and fairness. The work demonstrates practical implications for deploying multilingual moderation tools and provides a resource for benchmarking and improving safe deployment of S/LLMs in diverse linguistic and cultural contexts.

Abstract

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.
Paper Structure (33 sections, 12 figures, 3 tables)

This paper contains 33 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Weighted Cohen's $\kappa$ correlations per harm category in RTP-LX prompts. S/LLMs could detect insults, violence, and sexual content. Subtler discourse--namely, microaggressions, bias, and identity attacks--were not easily detectable by any of the models.
  • Figure 2: EM block rates when calculated using FLORES' Toxicity-200 block list for the transcreated/manual partition of RTP-LX. FLORES had an average $24.3\pm8.3\%$ block rate across all languages and partitions. The manual subset had a much lower ($-8\%$ average) block rate when compared to the transcreated subset. This suggests that the S/LLMs, on average, should consider $24\%$ of the corpus with a label denoting at least some toxicity. Note that English does not have a manual corpus.
  • Figure 3: We labelled the prompt subset with the S/LLMs and compared their output with the majority vote of the human scores. In terms of raw accuracy (left), Llama Guard outperformed all other S/LLMs, closely followed by Gemma 7B and GPT-4 Turbo. ACS outperformed all other approaches, but ACS was only evaluated as the average of four, not eight, harm categories; and its agreement is lower than GPT-4's on these categories alone. When looking at mean $\kappa_w$ (right), it is clear that raw accuracy scoring is not a sufficient measure due to RTP-LX's class imbalance--a lazy learner could output always the same label and obtain a decent performance. In fact, that is what happened for some models, such as Gemma 2B.
  • Figure 4: FPs across all languages for the S/LLMs. Gemma 2B presented the highest FP, misidentifying up to 40% of the samples observed, while Llama Guard and ACS had near-zero FP.
  • Figure 5: Language availability versus $\kappa_w$ over all languages in the prompts subset. All S/LLMs decreased in $\kappa_w$ from high to low-resource languages, with differences of up to around 10%.
  • ...and 7 more figures