Table of Contents
Fetching ...

Gendered Prompting and LLM Code Review: How Gender Cues in the Prompt Shape Code Quality and Evaluation

Lynn Janzen, Üveys Eroglu, Dorothea Kolossa, Pia Knöferle, Sebastian Möller, Vera Schmitt, Veronika Solopova

Abstract

LLMs are increasingly embedded in programming workflows, from code generation to automated code review. Yet, how gendered communication styles interact with LLM-assisted programming and code review remains underexplored. We present a mixed-methods pilot study examining whether gender-related linguistic differences in prompts influence code generation outcomes and code review decisions. Across three complementary studies, we analyze (i) collected real-world coding prompts, (ii) a controlled user study, in which developers solve identical programming tasks with LLM assistance, and (iii) an LLM-based simulated evaluation framework that systematically varies gender-coded prompt styles and reviewer personas. We find that gender-related differences in prompting style are subtle but measurable, with female-authored prompts exhibiting more indirect and involved language, which does not translate into consistent gaps in functional correctness or static code quality. For LLM code review, in contrast, we observe systematic biases: on average, models approve female-authored code more, despite comparable quality. Controlled experiments show that gender-coded prompt style affect code length and maintainability, while reviewer behavior varies across models. Our findings suggest that fairness risks in LLM-assisted programming arise less from generation accuracy than from LLM evaluation, as LLMs are increasingly deployed as automated code reviewers.

Gendered Prompting and LLM Code Review: How Gender Cues in the Prompt Shape Code Quality and Evaluation

Abstract

LLMs are increasingly embedded in programming workflows, from code generation to automated code review. Yet, how gendered communication styles interact with LLM-assisted programming and code review remains underexplored. We present a mixed-methods pilot study examining whether gender-related linguistic differences in prompts influence code generation outcomes and code review decisions. Across three complementary studies, we analyze (i) collected real-world coding prompts, (ii) a controlled user study, in which developers solve identical programming tasks with LLM assistance, and (iii) an LLM-based simulated evaluation framework that systematically varies gender-coded prompt styles and reviewer personas. We find that gender-related differences in prompting style are subtle but measurable, with female-authored prompts exhibiting more indirect and involved language, which does not translate into consistent gaps in functional correctness or static code quality. For LLM code review, in contrast, we observe systematic biases: on average, models approve female-authored code more, despite comparable quality. Controlled experiments show that gender-coded prompt style affect code length and maintainability, while reviewer behavior varies across models. Our findings suggest that fairness risks in LLM-assisted programming arise less from generation accuracy than from LLM evaluation, as LLMs are increasingly deployed as automated code reviewers.

Paper Structure

This paper contains 38 sections, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Illustration of the three studies.
  • Figure 2: Study I: average proportion of prompts per gender of a certain request type. Whiskers indicate the range of data within 1.5 times the interquartile range from the lower and upper quartiles. Points outside this range are considered outliers.
  • Figure 3: Study III: code structure and style by prompt group and provider. Female-coded prompts yield higher maintainability, while male-coded and neutral prompts tend to yield higher Pylint scores.
  • Figure 4: Study III: unit-test pass rates by prompt group and provider. Pass rates are tightly clustered across gender-coded prompts, indicating no substantial differences in functional correctness ($H0_1$).
  • Figure 5: Study III: LLM reviewer approval rates by prompt group and provider backend across reviewer personas. Personas respond identically for Anthropic models. Provider differences are encoded by color, while reviewer personas are encoded by line style and marker. Despite identical functional correctness, approval behavior varies by provider and interacts with prompt style across reviewer personas.
  • ...and 9 more figures