Table of Contents
Fetching ...

If there's a Trigger Warning, then where's the Trigger? Investigating Trigger Warnings at the Passage Level

Matti Wiegmann, Jennifer Rakete, Magdalena Wolska, Benno Stein, Martin Potthast

TL;DR

This work investigates locating trigger warnings at the passage level by constructing a 4,135-passage dataset annotated for eight warnings and evaluating a range of classifiers. It demonstrates that trigger annotation is inherently subjective, with substantial annotator disagreement and variable positive rates across warnings. The study systematically compares fine-tuned and few-shot models, including GPT-3.5/4, Mistral Mixtral, and Llama variants, across in-distribution and out-of-distribution settings, showing that no single approach dominates and that model choice should be tailored to the specific warning and configuration. The findings highlight the feasibility of automatic passage-level triggering while underscoring the need for diverse training data and potential personalization to improve generalization and moderation effectiveness.

Abstract

Trigger warnings are labels that preface documents with sensitive content if this content could be perceived as harmful by certain groups of readers. Since warnings about a document intuitively need to be shown before reading it, authors usually assign trigger warnings at the document level. What parts of their writing prompted them to assign a warning, however, remains unclear. We investigate for the first time the feasibility of identifying the triggering passages of a document, both manually and computationally. We create a dataset of 4,135 English passages, each annotated with one of eight common trigger warnings. In a large-scale evaluation, we then systematically evaluate the effectiveness of fine-tuned and few-shot classifiers, and their generalizability. We find that trigger annotation belongs to the group of subjective annotation tasks in NLP, and that automatic trigger classification remains challenging but feasible.

If there's a Trigger Warning, then where's the Trigger? Investigating Trigger Warnings at the Passage Level

TL;DR

This work investigates locating trigger warnings at the passage level by constructing a 4,135-passage dataset annotated for eight warnings and evaluating a range of classifiers. It demonstrates that trigger annotation is inherently subjective, with substantial annotator disagreement and variable positive rates across warnings. The study systematically compares fine-tuned and few-shot models, including GPT-3.5/4, Mistral Mixtral, and Llama variants, across in-distribution and out-of-distribution settings, showing that no single approach dominates and that model choice should be tailored to the specific warning and configuration. The findings highlight the feasibility of automatic passage-level triggering while underscoring the need for diverse training data and potential personalization to improve generalization and moderation effectiveness.

Abstract

Trigger warnings are labels that preface documents with sensitive content if this content could be perceived as harmful by certain groups of readers. Since warnings about a document intuitively need to be shown before reading it, authors usually assign trigger warnings at the document level. What parts of their writing prompted them to assign a warning, however, remains unclear. We investigate for the first time the feasibility of identifying the triggering passages of a document, both manually and computationally. We create a dataset of 4,135 English passages, each annotated with one of eight common trigger warnings. In a large-scale evaluation, we then systematically evaluate the effectiveness of fine-tuned and few-shot classifiers, and their generalizability. We find that trigger annotation belongs to the group of subjective annotation tasks in NLP, and that automatic trigger classification remains challenging but feasible.
Paper Structure (45 sections, 6 figures, 7 tables)

This paper contains 45 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Two example passages from fan fiction stories annotated for the trigger warning Death. The upper example was unanimously annotated as 'positive', the lower as 'negative'. The central sentence in italics was retrieved via keywords; preceding and following sentences serve as context. Figure \ref{['table-appendix-passage-examples']} shows more examples.
  • Figure 2: Warning Death. Structure of both the annotation instructions and the prompts used in the few-shot experiments. Shown here is an example passage for the Death warning. The italics sections vary by instance.
  • Figure 3: Trigger Warning. Selected example passages with assigned warning and number of positive votes. The center sentence that was retrieved via keywords is is italics.
  • Figure 4: Overview of the experimental results across distribution (in vs. out-of) and vote aggregation (minority vs. majority). (a) The bar charts show the mean accuracy for the models across all warnings with 95% t-estimated confidence interval. The box plots show the quartiles and outliers of all folds and warnings. (b) The bar chars show the mean accuracy for the warnings across all models with 95% t-estimated confidence interval. The upper and lower ticks show the best and worst model's performance. Table \ref{['table-appendix-classification-results']} (Appendix \ref{['appendix-a']}) shows the full results.
  • Figure 5: Trigger Warning. Selected example passages that were always misclassified by all models.
  • ...and 1 more figures