ToxSyn: Reducing Bias in Hate Speech Detection via Synthetic Minority Data in Brazilian Portuguese
Iago Alves Brito, Julia Soares Dollis, Fernanda Bufon Färber, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
TL;DR
ToxSyn introduces the first large-scale Portuguese, multi-label hate-speech corpus spanning nine protected groups, generated through a controllable four-stage pipeline to balance toxicity, group targets, and discourse types while including neutral content. The study demonstrates severe domain dependency in toxicity detection: models trained on social-media data fail to generalize to minority-targeted text, and vice versa, challenging the notion that a single model can master toxicity across domains. By providing fine-grained annotations and a diversified neutral set, ToxSyn enables minority-aware evaluation and reveals that Macro F1 can mask failures, underscoring the need for domain- and target-specific benchmarks. The work offers a reproducible blueprint for synthetic-data generation in low- and mid-resource languages and releases the dataset publicly to spur progress in equitable online safety tools.
Abstract
The development of robust hate speech detection systems remains limited by the lack of large-scale, fine-grained training data, especially for languages beyond English. Existing corpora typically rely on coarse toxic/non-toxic labels, and the few that capture hate directed at specific minority groups critically lack the non-toxic counterexamples (i.e., benign text about minorities) required to distinguish genuine hate from mere discussion. We introduce ToxSyn, the first Portuguese large-scale corpus explicitly designed for multi-label hate speech detection across nine protected minority groups. Generated via a controllable four-stage pipeline, ToxSyn includes discourse-type annotations to capture rhetorical strategies of toxic language, such as sarcasm or dehumanization. Crucially, it systematically includes the non-toxic counterexamples absent in all other public datasets. Our experiments reveal a catastrophic, mutual generalization failure between social-media domains and ToxSyn: models trained on social media struggle to generalize to minority-specific contexts, and vice-versa. This finding indicates they are distinct tasks and exposes summary metrics like Macro F1 can be unreliable indicators of true model behavior, as they completely mask model failure. We publicly release ToxSyn at HuggingFace to foster reproducible research on synthetic data generation and benchmark progress in hate-speech detection for low- and mid-resource languages.
