ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

Yuanyi Ren; Haoran Ye; Hanjun Fang; Xin Zhang; Guojie Song

ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, Guojie Song

TL;DR

ValueBench introduces the first comprehensive psychometric benchmark for evaluating value orientations and value understanding in large language models. It aggregates 453 value dimensions from 44 psychometric inventories and provides an evaluation pipeline grounded in authentic human-AI interactions, plus novel open-ended tasks that probe value understanding within a hierarchical value space. Across six diverse LLMs, ValueBench reveals both shared and model-specific value orientations, and demonstrates that LLMs can approximate expert conclusions in value-related extraction and generation tasks when properly prompted. The work contributes a unified data resource, structured evaluation procedures, and analyses that advance value alignment research and interdisciplinary study at the intersection of AI and psychology.

Abstract

Large Language Models (LLMs) are transforming diverse fields and gaining increasing influence as human proxies. This development underscores the urgent need for evaluating value orientations and understanding of LLMs to ensure their responsible integration into public-facing applications. This work introduces ValueBench, the first comprehensive psychometric benchmark for evaluating value orientations and value understanding in LLMs. ValueBench collects data from 44 established psychometric inventories, encompassing 453 multifaceted value dimensions. We propose an evaluation pipeline grounded in realistic human-AI interactions to probe value orientations, along with novel tasks for evaluating value understanding in an open-ended value space. With extensive experiments conducted on six representative LLMs, we unveil their shared and distinctive value orientations and exhibit their ability to approximate expert conclusions in value-related extraction and generation tasks. ValueBench is openly accessible at https://github.com/Value4AI/ValueBench.

ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

TL;DR

Abstract

Paper Structure (40 sections, 11 figures, 3 tables)

This paper contains 40 sections, 11 figures, 3 tables.

Introduction
Related Work
Value Theory.
Psychometric Evaluations of LLMs.
Value Understanding in LLMs.
ValueBench
The Structure of Human Values
ValueBench Dataset Construction
Item-Value Pair Extraction.
Value Interpretation Extraction.
Value Substructure Extraction.
Evaluations with ValueBench
Evaluating Value Orientations of LLMs
Evaluation Pipeline
Evaluation Results
...and 25 more sections

Figures (11)

Figure 1: Overview of ValueBench dataset construction. We collect psychometric inventories from domains including personality, social axioms, cognitive system, and general value theory. From these inventories, value definitions, value-item pairs, and value hierarchies are extracted and collected.
Figure 2: The evaluation pipeline of LLM value orientations, exemplified using an item drawn from Consciousness of Social Face Inventory. Each item is rephrased into a closed question and administered to LLMs for free-form responses. Each response is evaluated based on the extent to which it leans towards a "Yes", indirectly revealing the value orientation of an LLM.
Figure 3: Evaluation results of LLM value orientations. We illustrate the results of 12 representative inventories and defer the complete results to \ref{['app:extended results']}.
Figure 4: An example of inconsistency between LLM response in controlled settings (a rating of agreement with a statement) and in authentic human-AI interactions (responses to value-related user questions).
Figure 5: The evaluation pipeline of value understanding consists of three main tasks. First, we collect positive and negative samples of relevant value pairs from ValueBench and test LLMs' abilities to identify these relationships. Next, we conduct two generation tasks, namely item-to-value extraction and value-to-item generation, to evaluate the LLMs' performance in generating value-related content.
...and 6 more figures

ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

TL;DR

Abstract

ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)