DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?
Urja Khurana, Eric Nalisnick, Antske Fokkens
TL;DR
DefVerify tackles the problem of whether hate speech models faithfully reflect the dataset definitions used to train them. It introduces a three-step framework to encode definitions via Hate Speech Criteria, build an enriched diagnostic set from HateCheck, and assess alignment while diagnosing where misalignment originates; the approach is demonstrated across six widely-used English datasets. Although results reveal substantial gaps between dataset definitions and model behavior, DefVerify provides a practical protocol for diagnosing biases, annotation issues, and generalization problems prior to deployment. The work highlights the importance of explicit, public, and well-validated definitions and diagnostics to improve safety-critical NLP systems and offers a foundation for extending the methodology to multilingual and legally informed contexts.
Abstract
When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model's actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DefVerify to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.
