Table of Contents
Fetching ...

Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment

Asma Yamani, Malak Baslyman, Moataz Ahmed

TL;DR

The paper investigates how closely LLMs’ judgments on responsible AI values align with human judgments from AI practitioners and a US-representative sample across three standard tasks and a novel requirements-prioritization task. It evaluates 23 LLMs,Filtering to 15 robust models for T1–T3 and 8 for T4, and uses Spearman correlations to quantify alignment against human baselines. Findings show stronger alignment with AI practitioners than with the US population, particularly on core values like fairness, privacy, transparency, safety, and accountability, but reveal faithfulness gaps when translating abstract values into concrete software requirements. The study highlights practical risks of relying solely on LLMs in requirements engineering and argues for systematic benchmarks, interpretation, and human oversight to ensure robust value-aligned AI-assisted software development.

Abstract

Large Language Models (LLMs) are increasingly employed in software engineering tasks such as requirements elicitation, design, and evaluation, raising critical questions regarding their alignment with human judgments on responsible AI values. This study investigates how closely LLMs' value preferences align with those of two human groups: a US-representative sample and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key responsible AI values, (T2) rating their importance in specific contexts, (T3) resolving trade-offs between competing values, and (T4) prioritizing software requirements that embody those values. The results show that LLMs generally align more closely with AI practitioners than with the US-representative sample, emphasizing fairness, privacy, transparency, safety, and accountability. However, inconsistencies appear between the values that LLMs claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4), revealing gaps in faithfulness between stated and applied behavior. These findings highlight the practical risk of relying on LLMs in requirements engineering without human oversight and motivate the need for systematic approaches to benchmark, interpret, and monitor value alignment in AI-assisted software development.

Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment

TL;DR

The paper investigates how closely LLMs’ judgments on responsible AI values align with human judgments from AI practitioners and a US-representative sample across three standard tasks and a novel requirements-prioritization task. It evaluates 23 LLMs,Filtering to 15 robust models for T1–T3 and 8 for T4, and uses Spearman correlations to quantify alignment against human baselines. Findings show stronger alignment with AI practitioners than with the US population, particularly on core values like fairness, privacy, transparency, safety, and accountability, but reveal faithfulness gaps when translating abstract values into concrete software requirements. The study highlights practical risks of relying solely on LLMs in requirements engineering and argues for systematic benchmarks, interpretation, and human oversight to ensure robust value-aligned AI-assisted software development.

Abstract

Large Language Models (LLMs) are increasingly employed in software engineering tasks such as requirements elicitation, design, and evaluation, raising critical questions regarding their alignment with human judgments on responsible AI values. This study investigates how closely LLMs' value preferences align with those of two human groups: a US-representative sample and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key responsible AI values, (T2) rating their importance in specific contexts, (T3) resolving trade-offs between competing values, and (T4) prioritizing software requirements that embody those values. The results show that LLMs generally align more closely with AI practitioners than with the US-representative sample, emphasizing fairness, privacy, transparency, safety, and accountability. However, inconsistencies appear between the values that LLMs claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4), revealing gaps in faithfulness between stated and applied behavior. These findings highlight the practical risk of relying on LLMs in requirements engineering without human oversight and motivate the need for systematic approaches to benchmark, interpret, and monitor value alignment in AI-assisted software development.

Paper Structure

This paper contains 10 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Results of selecting the most important five values from 12 values. LLMs align with AI practitioners more than the US-representative sample selection. $N_{per LLM} = 50$, $N_{US-representative sample} = 516$, $N_{AI practitioners} = 140$.
  • Figure 2: Percentage of LLM responses selecting a responsible AI value as Extremely important (dark color) or very important (light color) across the four investigated contexts. $N_{per value,LLM} = 200$, $N_{US-representative sample and AI practitioners} = 140 - 607$.
  • Figure 3: Percentage of LLM responses selecting a responsible AI value as Extremely important (dark color) or very important (light color) across the four investigated contexts. $N_{per value,context} = 400$. The dotted bars illustrates the US-representative sample responding by extremely/very important to a certain value.
  • Figure 4: Percentage of LLM responses prioritizing a certain value over another across the four contexts. Lightly shaded area is a lower degree of preference. Undecided values are omitted. $N_{per value,LLM} = 200$, $N_{US-representative sample and AI practitioners} = 140 - 607$
  • Figure 5: Percentage of LLM responses prioritizing a certain value over another by context. Lightly shaded area is a lower degree of preference. Undecided values are omitted. $N_{per value,context} = 400$. The dotted bars illustrates the US-representative sample responding by strongly/somewhat prioritizing a certain value.
  • ...and 2 more figures