Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models

Rebekka Görge; Michael Mock; Héctor Allende-Cid

Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models

Rebekka Görge, Michael Mock, Héctor Allende-Cid

TL;DR

This work tackles stereotype-related data bias in large language models by grounding stereotype signals in sociolinguistic theory via the Social Category and Stereotype Communication (SCSC) framework. It develops a fixed categorization scheme of linguistic indicators for category labels and their content, and leverages in-context learning with multiple LLMs to detect and classify these indicators at the sentence level. A linear regression-based weighting scheme maps indicators to a continuous stereotype strength score, enabling fine-grained, interpretable quantification and comparison to human rankings. Evaluations on CrowS-Pairs show that larger models achieve higher accuracy and that the approach yields an interpretable scoring function, though challenges remain in connotation and generalization and in scaling beyond sentence-level analysis. The method offers a scalable, interpretable tool for stereotype auditing of text data and model outputs, with potential uses in pre-filtering and bias analysis while highlighting avenues for incorporating broader context and sentiment dimensions in future work.

Abstract

Social categories and stereotypes are embedded in language and can introduce data bias into Large Language Models (LLMs). Despite safeguards, these biases often persist in model behavior, potentially leading to representational harm in outputs. While sociolinguistic research provides valuable insights into the formation of stereotypes, NLP approaches for stereotype detection rarely draw on this foundation and often lack objectivity, precision, and interpretability. To fill this gap, in this work we propose a new approach that detects and quantifies the linguistic indicators of stereotypes in a sentence. We derive linguistic indicators from the Social Category and Stereotype Communication (SCSC) framework which indicate strong social category formulation and stereotyping in language, and use them to build a categorization scheme. To automate this approach, we instruct different LLMs using in-context learning to apply the approach to a sentence, where the LLM examines the linguistic properties and provides a basis for a fine-grained assessment. Based on an empirical evaluation of the importance of different linguistic indicators, we learn a scoring function that measures the linguistic indicators of a stereotype. Our annotations of stereotyped sentences show that these indicators are present in these sentences and explain the strength of a stereotype. In terms of model performance, our results show that the models generally perform well in detecting and classifying linguistic indicators of category labels used to denote a category, but sometimes struggle to correctly evaluate the associated behaviors and characteristics. Using more few-shot examples within the prompts, significantly improves performance. Model performance increases with size, as Llama-3.3-70B-Instruct and GPT-4 achieve comparable results that surpass those of Mixtral-8x7B-Instruct, GPT-4-mini and Llama-3.1-8B-Instruct.

Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models

TL;DR

Abstract

Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)