Table of Contents
Fetching ...

Do LLMs have a Gender (Entropy) Bias?

Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta

TL;DR

This work defines entropy bias as a disparity in the information content of LLM outputs across gendered prompts and investigates its presence in real-world, business-health questions using the RealWorldQuestioning benchmark. The study combines quantitative measures (Shannon entropy, CTTR, Maas) and LLM-as-judge evaluations to assess gender-related differences across four domains and multiple models. It finds no strong category-level bias but observes question-level disparities that can offset in aggregate; a simple, prompt-based debiasing method that merges gendered responses often yields higher information content than either variant. The results highlight both the feasibility of lightweight debiasing and the importance of using realistic, real-world data and human-in-the-loop evaluation to ensure fair and informative AI assistance in practical settings.

Abstract

We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as "LLM-as-judge"). Our analyses (metric-based comparisons and "LLM-as-judge" evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which "cancel" each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.

Do LLMs have a Gender (Entropy) Bias?

TL;DR

This work defines entropy bias as a disparity in the information content of LLM outputs across gendered prompts and investigates its presence in real-world, business-health questions using the RealWorldQuestioning benchmark. The study combines quantitative measures (Shannon entropy, CTTR, Maas) and LLM-as-judge evaluations to assess gender-related differences across four domains and multiple models. It finds no strong category-level bias but observes question-level disparities that can offset in aggregate; a simple, prompt-based debiasing method that merges gendered responses often yields higher information content than either variant. The results highlight both the feasibility of lightweight debiasing and the importance of using realistic, real-world data and human-in-the-loop evaluation to ensure fair and informative AI assistance in practical settings.

Abstract

We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as "LLM-as-judge"). Our analyses (metric-based comparisons and "LLM-as-judge" evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which "cancel" each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.

Paper Structure

This paper contains 65 sections, 8 equations, 8 figures, 7 tables, 7 algorithms.

Figures (8)

  • Figure 1: Example of Bias in Education Recommendation
  • Figure 2: Example of Bias in Job Recommendation
  • Figure 3: Example of Bias in Investment Recommendation
  • Figure 4: Example of Bias in Health Recommendation
  • Figure 5: Example of Bias in Education Recommendation
  • ...and 3 more figures