If there's a Trigger Warning, then where's the Trigger? Investigating Trigger Warnings at the Passage Level
Matti Wiegmann, Jennifer Rakete, Magdalena Wolska, Benno Stein, Martin Potthast
TL;DR
This work investigates locating trigger warnings at the passage level by constructing a 4,135-passage dataset annotated for eight warnings and evaluating a range of classifiers. It demonstrates that trigger annotation is inherently subjective, with substantial annotator disagreement and variable positive rates across warnings. The study systematically compares fine-tuned and few-shot models, including GPT-3.5/4, Mistral Mixtral, and Llama variants, across in-distribution and out-of-distribution settings, showing that no single approach dominates and that model choice should be tailored to the specific warning and configuration. The findings highlight the feasibility of automatic passage-level triggering while underscoring the need for diverse training data and potential personalization to improve generalization and moderation effectiveness.
Abstract
Trigger warnings are labels that preface documents with sensitive content if this content could be perceived as harmful by certain groups of readers. Since warnings about a document intuitively need to be shown before reading it, authors usually assign trigger warnings at the document level. What parts of their writing prompted them to assign a warning, however, remains unclear. We investigate for the first time the feasibility of identifying the triggering passages of a document, both manually and computationally. We create a dataset of 4,135 English passages, each annotated with one of eight common trigger warnings. In a large-scale evaluation, we then systematically evaluate the effectiveness of fine-tuned and few-shot classifiers, and their generalizability. We find that trigger annotation belongs to the group of subjective annotation tasks in NLP, and that automatic trigger classification remains challenging but feasible.
