Table of Contents
Fetching ...

Sociocultural Considerations in Monitoring Anti-LGBTQ+ Content on Social Media

Sidney G. -J. Wong

TL;DR

This paper investigates how sociocultural factors shape hate-speech detection systems for anti-LGBTQ+ content on social media. It compares two open-source datasets (mlma and ltedi) and trains transformer-based detectors, applying them to georeferenced samples across national-English varieties to examine cross-cultural performance. The findings show that training-data sociolinguistic alignment biases model outputs and that keyword-based data collection can overfit on slurs, potentially missing non-slur anti-LGBTQ+ content; domain adaptation and qualitative interpretation are necessary. The study highlights the need to integrate sociolinguistic context in both data collection and model evaluation to produce fit-for-purpose monitoring tools and informs future work toward more inclusive, context-aware hate-speech detection.

Abstract

The purpose of this paper is to ascertain the influence of sociocultural factors (i.e., social, cultural, and political) in the development of hate speech detection systems. We set out to investigate the suitability of using open-source training data to monitor levels of anti-LGBTQ+ content on social media across different national-varieties of English. Our findings suggests the social and cultural alignment of open-source hate speech data sets influences the predicted outputs. Furthermore, the keyword-search approach of anti-LGBTQ+ slurs in the development of open-source training data encourages detection models to overfit on slurs; therefore, anti-LGBTQ+ content may go undetected. We recommend combining empirical outputs with qualitative insights to ensure these systems are fit for purpose.

Sociocultural Considerations in Monitoring Anti-LGBTQ+ Content on Social Media

TL;DR

This paper investigates how sociocultural factors shape hate-speech detection systems for anti-LGBTQ+ content on social media. It compares two open-source datasets (mlma and ltedi) and trains transformer-based detectors, applying them to georeferenced samples across national-English varieties to examine cross-cultural performance. The findings show that training-data sociolinguistic alignment biases model outputs and that keyword-based data collection can overfit on slurs, potentially missing non-slur anti-LGBTQ+ content; domain adaptation and qualitative interpretation are necessary. The study highlights the need to integrate sociolinguistic context in both data collection and model evaluation to produce fit-for-purpose monitoring tools and informs future work toward more inclusive, context-aware hate-speech detection.

Abstract

The purpose of this paper is to ascertain the influence of sociocultural factors (i.e., social, cultural, and political) in the development of hate speech detection systems. We set out to investigate the suitability of using open-source training data to monitor levels of anti-LGBTQ+ content on social media across different national-varieties of English. Our findings suggests the social and cultural alignment of open-source hate speech data sets influences the predicted outputs. Furthermore, the keyword-search approach of anti-LGBTQ+ slurs in the development of open-source training data encourages detection models to overfit on slurs; therefore, anti-LGBTQ+ content may go undetected. We recommend combining empirical outputs with qualitative insights to ensure these systems are fit for purpose.
Paper Structure (19 sections, 5 figures, 5 tables)

This paper contains 19 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Model comparison of anti-LGBTQ+ hate speech on ten randomised samples of 10,000 posts/tweets per month from India between June 2018 to June 2023 including grouped mean and the upper and lower confidence intervals.
  • Figure 2: Comparison of anti-LGBTQ+ hate speech detected in 10,000 samples of posts/tweets from inner- and outer-circle varieties of English between June 2018 to June 2023 including grouped mean and the upper and lower confidence intervals.
  • Figure 3: Quarterly growth rate of anti-LGBTQ+ hate speech detected with the ltedi model with number of posts/tweets by country between June 2018 and June 2023.
  • Figure 4: mlma Wordcloud.
  • Figure 5: ltedi Wordcloud.