Table of Contents
Fetching ...

Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection

Myrthe Reuver, Indira Sen, Matteo Melis, Gabriella Lapesa

TL;DR

This work addresses the challenge of detecting sexism by aligning domain expert knowledge with large language models through a four-part pipeline that combines survey data, interactive expert–LLM sessions, co-created definitions, and zero-shot classification. The authors collect 27 definitions (expert-written, LLM-generated, and co-created) from nine sexism researchers and test their impact on 2,500 texts drawn from five sexism benchmarks using GPT4o, yielding 67,500 classifications. Across results, LLM-generated definitions generally improve classification performance, while expert-written definitions underperform, though co-created definitions yield context-specific gains for certain experts and datasets, revealing nuanced interactions between human expertise, prompting strategies, and model behavior. The study contributes a methodological framework, a dataset of expert–LLM interactions, and empirical evidence on how hybrid intelligence can be leveraged for social-construct detection, while acknowledging limitations such as sample size, language, and reliance on proprietary models. Overall, the research informs how expert-guided prompting and co-creation can influence zero-shot sexism detection and highlights the variability across datasets and individual experts, guiding future work toward more robust and generalizable hybrid approaches.

Abstract

This paper investigates hybrid intelligence and collaboration between researchers of sexism and Large Language Models (LLMs), with a four-component pipeline. First, nine sexism researchers answer questions about their knowledge of sexism and of LLMs. They then participate in two interactive experiments involving an LLM (GPT3.5). The first experiment has experts assessing the model's knowledge about sexism and suitability for use in research. The second experiment tasks them with creating three different definitions of sexism: an expert-written definition, an LLM-written one, and a co-created definition. Lastly, zero-shot classification experiments use the three definitions from each expert in a prompt template for sexism detection, evaluating GPT4o on 2.500 texts sampled from five sexism benchmarks. We then analyze the resulting 67.500 classification decisions. The LLM interactions lead to longer and more complex definitions of sexism. Expert-written definitions on average perform poorly compared to LLM-generated definitions. However, some experts do improve classification performance with their co-created definitions of sexism, also experts who are inexperienced in using LLMs.

Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection

TL;DR

This work addresses the challenge of detecting sexism by aligning domain expert knowledge with large language models through a four-part pipeline that combines survey data, interactive expert–LLM sessions, co-created definitions, and zero-shot classification. The authors collect 27 definitions (expert-written, LLM-generated, and co-created) from nine sexism researchers and test their impact on 2,500 texts drawn from five sexism benchmarks using GPT4o, yielding 67,500 classifications. Across results, LLM-generated definitions generally improve classification performance, while expert-written definitions underperform, though co-created definitions yield context-specific gains for certain experts and datasets, revealing nuanced interactions between human expertise, prompting strategies, and model behavior. The study contributes a methodological framework, a dataset of expert–LLM interactions, and empirical evidence on how hybrid intelligence can be leveraged for social-construct detection, while acknowledging limitations such as sample size, language, and reliance on proprietary models. Overall, the research informs how expert-guided prompting and co-creation can influence zero-shot sexism detection and highlights the variability across datasets and individual experts, guiding future work toward more robust and generalizable hybrid approaches.

Abstract

This paper investigates hybrid intelligence and collaboration between researchers of sexism and Large Language Models (LLMs), with a four-component pipeline. First, nine sexism researchers answer questions about their knowledge of sexism and of LLMs. They then participate in two interactive experiments involving an LLM (GPT3.5). The first experiment has experts assessing the model's knowledge about sexism and suitability for use in research. The second experiment tasks them with creating three different definitions of sexism: an expert-written definition, an LLM-written one, and a co-created definition. Lastly, zero-shot classification experiments use the three definitions from each expert in a prompt template for sexism detection, evaluating GPT4o on 2.500 texts sampled from five sexism benchmarks. We then analyze the resulting 67.500 classification decisions. The LLM interactions lead to longer and more complex definitions of sexism. Expert-written definitions on average perform poorly compared to LLM-generated definitions. However, some experts do improve classification performance with their co-created definitions of sexism, also experts who are inexperienced in using LLMs.

Paper Structure

This paper contains 72 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Experts participate in a survey (part I) as well as two interactive experiments (part II and III), after which we perform zero-shot classification experiments (part IV) with two LLMs, using the sexism definitions created during the interaction experiments.
  • Figure 2: Explanation of the interactive experiments of Part II and Part III of our pipeline.
  • Figure 3: $F1$ (macro) performance of GPT4o per participant over the three definitions (upper plot) and over the five datasets (bottom row).
  • Figure 4: Heatmap of Likert scale on participants experience on LLMs.
  • Figure 5: Heatmap of Likert scale auto-reported experience on sexism research.
  • ...and 9 more figures