Directions in Abusive Language Training Data: Garbage In, Garbage Out
Bertie Vidgen, Leon Derczynski
TL;DR
This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers and reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com.
Abstract
Data-driven analysis and detection of abusive online content covers many different tasks, phenomena, contexts, and methodologies. This paper systematically reviews abusive language dataset creation and content in conjunction with an open website for cataloguing abusive language data. This collection of knowledge leads to a synthesis providing evidence-based recommendations for practitioners working with this complex and highly diverse data.
