Table of Contents
Fetching ...

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven

TL;DR

HEARTS tackles the challenge of detecting stereotypes in text by integrating a diversified dataset expansion (EMGSD) with a carbon-efficient, explainable classifier (ALBERT-V2) and a robust token-level explanation framework using SHAP and LIME. The Expanded Multi-Grain Stereotype Dataset (EMGSD) covers six demographic axes and achieves strong performance while reducing environmental impact during fine-tuning. The approach emphasizes transparency through token-level rankings and explanation confidence, and extends analysis to quantify stereotypical bias in a broad set of LLM outputs. Collectively, the work advances trustworthy, scalable stereotype detection with practical implications for responsible AI deployment and bias assessment across models.

Abstract

Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labelled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs...

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

TL;DR

HEARTS tackles the challenge of detecting stereotypes in text by integrating a diversified dataset expansion (EMGSD) with a carbon-efficient, explainable classifier (ALBERT-V2) and a robust token-level explanation framework using SHAP and LIME. The Expanded Multi-Grain Stereotype Dataset (EMGSD) covers six demographic axes and achieves strong performance while reducing environmental impact during fine-tuning. The approach emphasizes transparency through token-level rankings and explanation confidence, and extends analysis to quantify stereotypical bias in a broad set of LLM outputs. Collectively, the work advances trustworthy, scalable stereotype detection with practical implications for responsible AI deployment and bias assessment across models.

Abstract

Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labelled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs...
Paper Structure (19 sections, 10 equations, 28 figures, 17 tables)

This paper contains 19 sections, 10 equations, 28 figures, 17 tables.

Figures (28)

  • Figure 1: Overview of the dataset filtering and augmentation process for the WinoQueer and SeeGULL datasets. The WinoQueer dataset (91,080 sentences) undergoes filtering by removing duplicates, counterfactual statements, and overtly negative sentences, resulting in a refined set of 1,088 sentences. The SeeGULL dataset (6,781 phrases) is filtered to remove non-offensive and non-stereotypical sentences, yielding 690 phrases. Sentence generation using Mistral Medium expands these phrases to 690 sentences. Both filtered datasets are then augmented using GPT-4 to generate three categories: neutral, stereotypical, and unrelated sentences, contributing a total of 5,334 additional observations to the MGSD.
  • Figure 2: Evolution of test set F1 score by text length for ALBERT-V2 model trained on EMGSD. Scores are calculated by taking mean F1 score for sentences of a given text length in EMGSD test data, for all text lengths where at least 10 samples can be drawn.
  • Figure 3: Comparison of SHAP and LIME token rankings for correct model prediction, indicating close alignment.
  • Figure 4: Comparison of SHAP and LIME token rankings for incorrect model prediction, indicating divergent outcomes.
  • Figure 5: Stereotype prevalence in LLM outputs by model release date. Stemmed text instances from the EMGSD test set (neutral prompts) are used to elicit 1,050 responses per model.
  • ...and 23 more figures