Table of Contents
Fetching ...

Prompt and Prejudice

Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Marco Bertini, Alberto Del Bimbo

TL;DR

This study investigates how injecting first names into prompts affects ethical judgments produced by Large Language Models and Vision-Language Models. It combines ETHICS, a large social-science–designed ethical benchmark, with PSB, a practical benchmark of real-world decision scenarios, applying over 300 names across diverse demographics to thousands of queries. The authors develop an evaluation framework using Accuracy and Goodness metrics to reveal gender and ethnolinguistic biases, demonstrating that demographic signals can skew model outputs in high-stakes contexts. The findings underscore the need for systematic auditing and bias-mitigation strategies to ensure fair and reliable AI-assisted decision-making in real-world applications.

Abstract

This paper investigates the impact of using first names in Large Language Models (LLMs) and Vision Language Models (VLMs), particularly when prompted with ethical decision-making tasks. We propose an approach that appends first names to ethically annotated text scenarios to reveal demographic biases in model outputs. Our study involves a curated list of more than 300 names representing diverse genders and ethnic backgrounds, tested across thousands of moral scenarios. Following the auditing methodologies from social sciences we propose a detailed analysis involving popular LLMs/VLMs to contribute to the field of responsible AI by emphasizing the importance of recognizing and mitigating biases in these systems. Furthermore, we introduce a novel benchmark, the Pratical Scenarios Benchmark (PSB), designed to assess the presence of biases involving gender or demographic prejudices in everyday decision-making scenarios as well as practical scenarios where an LLM might be used to make sensible decisions (e.g., granting mortgages or insurances). This benchmark allows for a comprehensive comparison of model behaviors across different demographic categories, highlighting the risks and biases that may arise in practical applications of LLMs and VLMs.

Prompt and Prejudice

TL;DR

This study investigates how injecting first names into prompts affects ethical judgments produced by Large Language Models and Vision-Language Models. It combines ETHICS, a large social-science–designed ethical benchmark, with PSB, a practical benchmark of real-world decision scenarios, applying over 300 names across diverse demographics to thousands of queries. The authors develop an evaluation framework using Accuracy and Goodness metrics to reveal gender and ethnolinguistic biases, demonstrating that demographic signals can skew model outputs in high-stakes contexts. The findings underscore the need for systematic auditing and bias-mitigation strategies to ensure fair and reliable AI-assisted decision-making in real-world applications.

Abstract

This paper investigates the impact of using first names in Large Language Models (LLMs) and Vision Language Models (VLMs), particularly when prompted with ethical decision-making tasks. We propose an approach that appends first names to ethically annotated text scenarios to reveal demographic biases in model outputs. Our study involves a curated list of more than 300 names representing diverse genders and ethnic backgrounds, tested across thousands of moral scenarios. Following the auditing methodologies from social sciences we propose a detailed analysis involving popular LLMs/VLMs to contribute to the field of responsible AI by emphasizing the importance of recognizing and mitigating biases in these systems. Furthermore, we introduce a novel benchmark, the Pratical Scenarios Benchmark (PSB), designed to assess the presence of biases involving gender or demographic prejudices in everyday decision-making scenarios as well as practical scenarios where an LLM might be used to make sensible decisions (e.g., granting mortgages or insurances). This benchmark allows for a comprehensive comparison of model behaviors across different demographic categories, highlighting the risks and biases that may arise in practical applications of LLMs and VLMs.
Paper Structure (20 sections, 2 figures, 10 tables)

This paper contains 20 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Per model Goodness gender distance from average on the ETHICS subtasks
  • Figure 2: VLM pipeline illustration. After generating a portrait corresponding to a prompt with gender and ethnicity information we use the image along with the text scenarios from either ETHICS or Pratical Scenarios Benchmark to test LLaVA.