Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Neemesh Yadav; Sarah Masud; Vikram Goyal; Vikram Goyal; Md Shad Akhtar; Tanmoy Chakraborty

Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Neemesh Yadav, Sarah Masud, Vikram Goyal, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty

TL;DR

This work introduces Tox-BART, a toxicity-attribute–driven explanation generator for implicit hate speech that relies on in-dataset and in-domain toxicity signals rather than traditional knowledge graphs. Through extensive comparisons with KG-infused baselines and zero-shot GPT-3.5, the authors show that toxicity signals can achieve comparable or superior explanatory quality, with human evaluators often preferring Tox-BART for specificity and relevance. Ablation studies reveal nuanced effects of signal configuration, with in-dataset attributes providing robust gains, while the quality of KG tuples yields inconsistent improvements. The study underscores the importance of domain-specific signals and human-in-the-loop curation for subjective tasks like implicit hate explanation, and highlights implications for moderation pipelines and future research on domain-aware representations and efficient external-signal integration.

Abstract

Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflicting evidence for the role of the quality of KG tuples in generating implicit explanations. Consequently, simpler models incorporating external toxicity signals outperform KG-infused models. Compared to the KG-based setup, we observe a comparable performance for SBIC (LatentHatred) datasets with a performance variation of +0.44 (+0.49), +1.83 (-1.56), and -4.59 (+0.77) in BLEU, ROUGE-L, and BERTScore. Further human evaluation and error analysis reveal that our proposed setup produces more precise explanations than zero-shot GPT-3.5, highlighting the intricate nature of the task.

Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 4 figures, 13 tables, 2 algorithms)

This paper contains 25 sections, 5 equations, 4 figures, 13 tables, 2 algorithms.

Introduction
Related Work
Infusing Toxicity Attributes for Explaining Implicit Hate
Impact of Infusing Toxicity Attributes
Automated Evaluation.
Human Evaluation.
Ablation Study.
Impact of Toxicity Probabilities.
Impact of flipping Toxic Attributes.
Error Analysis.
Auditing the quality of KG tuples
Conclusion
Acknowledgements
Limitations
Ethical Considerations
...and 10 more sections

Figures (4)

Figure 1: A sample text (verbatim from SBIC) witnessing an improvement in toxicity and target detection when the incoming post is infused with implied context. We infer toxicity scores from the Unitary toxicity API and Toxigen-RoBERTa. For target detection, we prompt the ChatGPT user interface.
Figure 2: Workflow of our proposed system Tox-BART utilizing toxicity attributes (in-dataset and in-domain) for explaining implicit hate.
Figure 3: Analysis of top-k ($k=20$) KG tuples for SBIC and LatentHatred capturing the spread of raw score values for (a) ConceptNet and (b) StereoKG respectively. Here, the x-axis represents the score value as either binned (for ConceptNet) or rounded to the nearest 1st decimal (for StereoKG). The bins range from [start, end) except for the last bin.
Figure 4: Analysis of top-k KG tuples retrieved for test samples of SBIC and LatentHatred at $k=20$, w.r.t ConceptNet, and StereoKG. as described in Section \ref{['section:cos_quality_relevance']}, for ConceptNet we evaluate the IDF weighted relevance (rel.) scores. For StereoKG, we evaluate via the cosine similarity (Sim.) scores. All the y-axis captures the proportion of samples corresponding to the analysis at hand. Given that we look at top $20$ tuples based on scores, (a) and (b) capture the spread of uniqueness in scores obtained per sample, respectively, for ConceptNet and StereoKGḢere, the $i$th index on the x-axis is the number of unique scores out of $20$ present in the samples.

Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

TL;DR

Abstract

Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (4)