Table of Contents
Fetching ...

VAL-Bench: Measuring Value Alignment in Language Models

Aman Gupta, Denny O'Shea, Fazl Barez

TL;DR

VAL-Bench introduces a large-scale, automated benchmark to test whether LLMs maintain stable value stances across opposing framings of controversial issues. By mining 115K paired prompts from Wikipedia and evaluating model responses with an LLM-based judge via the PAC metric, the study reveals substantial variation across models and a clear safety-expressivity trade-off. The work documents how refusals can mask values while more expressive models may exhibit inconsistency, and it provides a pathway for systematic, reproducible measurement of value alignment. It also discusses training implications, evaluation calibration, and potential avenues for improving belief-consistent alignment in future LM development.

Abstract

Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined safety violations, but these only check rule compliance and do not reveal whether a model upholds a coherent value system when facing controversial real-world issues. We introduce the Value ALignment Benchmark (VAL-Bench), which evaluates whether models maintain a stable value stance across paired prompts that frame opposing sides of public debates. VAL-Bench consists of 115K such pairs from Wikipedia's controversial sections. A well-aligned model should express similar underlying views regardless of framing, which we measure using an LLM-as-judge to score agreement or divergence between paired responses. Applied across leading open- and closed-source models, the benchmark reveals large variation in alignment and highlights trade-offs between safety strategies (e.g., refusals) and more expressive value systems. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic comparison of how reliably LLMs embody human values.

VAL-Bench: Measuring Value Alignment in Language Models

TL;DR

VAL-Bench introduces a large-scale, automated benchmark to test whether LLMs maintain stable value stances across opposing framings of controversial issues. By mining 115K paired prompts from Wikipedia and evaluating model responses with an LLM-based judge via the PAC metric, the study reveals substantial variation across models and a clear safety-expressivity trade-off. The work documents how refusals can mask values while more expressive models may exhibit inconsistency, and it provides a pathway for systematic, reproducible measurement of value alignment. It also discusses training implications, evaluation calibration, and potential avenues for improving belief-consistent alignment in future LM development.

Abstract

Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined safety violations, but these only check rule compliance and do not reveal whether a model upholds a coherent value system when facing controversial real-world issues. We introduce the Value ALignment Benchmark (VAL-Bench), which evaluates whether models maintain a stable value stance across paired prompts that frame opposing sides of public debates. VAL-Bench consists of 115K such pairs from Wikipedia's controversial sections. A well-aligned model should express similar underlying views regardless of framing, which we measure using an LLM-as-judge to score agreement or divergence between paired responses. Applied across leading open- and closed-source models, the benchmark reveals large variation in alignment and highlights trade-offs between safety strategies (e.g., refusals) and more expressive value systems. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic comparison of how reliably LLMs embody human values.

Paper Structure

This paper contains 32 sections, 2 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Illustration of how VAL-bench assesses alignment. When asked to support opposing views, misaligned AI systems often yield, whereas aligned systems maintain consistency. Both response pairs shown are generated by recent, popular LLMs.
  • Figure 2: Values demonstrated by LLM responses compared to expected values from the prompts. Bars represent Pearson residuals; positive residuals indicate over-representation and negative residuals indicate under-representation. ($\cdot$) shows actual counts of responses demonstrating each value. $VDR$ is the Value Demonstration Rate, measuring value expressivity. $\tilde{\chi}^2$ is the Reduced ${\chi}^2$ Statistic; indicating value alignment. These metrics illustrate the alignment-expressivity tradeoff emerging from direct value analysis.
  • Figure 3: Metrics vs Issue awareness. We see clear correlations with all 3 primary metrics.