ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models

Zhaowei Zhang; Fengshuo Bai; Jun Gao; Yaodong Yang

ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models

Zhaowei Zhang, Fengshuo Bai, Jun Gao, Yaodong Yang

Abstract

Personal values are a crucial factor behind human decision-making. Considering that Large Language Models (LLMs) have been shown to impact human decisions significantly, it is essential to make sure they accurately understand human values to ensure their safety. However, evaluating their grasp of these values is complex due to the value's intricate and adaptable nature. We argue that truly understanding values in LLMs requires considering both "know what" and "know why". To this end, we present a comprehensive evaluation metric, ValueDCG (Value Discriminator-Critique Gap), to quantitatively assess the two aspects with an engineering implementation. We assess four representative LLMs and provide compelling evidence that the growth rates of LLM's "know what" and "know why" capabilities do not align with increases in parameter numbers, resulting in a decline in the models' capacity to understand human values as larger amounts of parameters. This may further suggest that LLMs might craft plausible explanations based on the provided context without truly understanding their inherent value, indicating potential risks.

ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models

Abstract

Paper Structure (18 sections, 4 equations, 3 figures, 4 tables)

This paper contains 18 sections, 4 equations, 3 figures, 4 tables.

Introduction
Motivation: Some Brief Examples
Related Work
Method
Discriminator-Critique Gap
Definition and Quantification Methods
Overall Framework
Experiments
Experiment Settings
Consistency of GPT Evaluation
Evaluation for the Understanding of Values
Conclusions
Details for Values Definition
Human Data Collection
Annotator Details
...and 3 more sections

Figures (3)

Figure 1: A simple example to illustrate the impact of differences in understanding of value by LLMs on social good decisions. In the figure, a government official seeks advice from the LLM for the renovation of a library, requiring consideration of equal control over the public area by all local residents, that is, a comprehensive understanding of the value of "Power". For the first LLM, it is capable and fully understands human values, and demonstrates both helpfulness and harmlessness. The second model has some ability but does not understand human values. The answer seems reasonable at first glance, but upon closer inspection, there are problems. In reality, it is neither helpful nor harmless and could lead to serious social dissatisfaction. The third one is less capable, but fully understands human values. It knows what needs to be done but cannot provide it, reflecting harmlessness. We believe that the first and third ones both have a good understanding of values, which can satisfy the harmless requirements of LLMs.
Figure 2: Overview of our engineering implementation framework for measuring ValueDCG, which needs to be read from bottom to top. This evaluation framework quantifies both "know what" and "know why" and computes ValueDCG based on their discrepancy. For the formal part, we calculate the correctness between the LLM-generated label $\hat{x^l}$ and ground truth $x^l$. For the latter part, we let LLM output analyses for three aspects: Attribution Analysis, Counterfactual Analysis, and Rebuttal Argument, denoted as $Res_a$, $Res_c$, and $Res_r$. We then construct a GPT evaluator to map these three responses to scalar values 1-5, denoted as $V_a$, $V_c$, and $V_r$. We calculate their average $V_{avg}$ and normalize it as the quantification metric. The ValueDCG value $\mathcal{Q}_{\text{vdcg}}$ for the tested LLM $m$ is calculated as the absolute difference between discriminator and critique scores.
Figure 3: The confusion matrix of "know why" scoring, with normalized row-sums. Each subfigure contains 200 evaluation data points. The row axis represents the annotation distribution of GPT-4o, while the column axis represents the annotation distribution of 10 human annotators. Darker colors indicate a higher frequency of overlapping annotations. The four subfigures respectively show the consistent experimental results of attribution analysis, counterfactual analysis, and rebuttal argument. It can be seen that the experimental results indicate that although GPT-4o tends to over-annotate to some extent, it generally aligns with human annotations across the three dimensions.

ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models

Abstract

ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models

Authors

Abstract

Table of Contents

Figures (3)