Table of Contents
Fetching ...

High-Dimension Human Value Representation in Large Language Models

Samuel Cahyawijaya, Delong Chen, Yejin Bang, Leila Khalatbari, Bryan Wilie, Ziwei Ji, Etsuko Ishii, Pascale Fung

TL;DR

This work tackles the challenge of understanding how large language models encode human values by introducing UniVaR, a high-dimensional, language- and model-invariant value embedding. UniVaR is learned via a self-supervised, multi-view Siamese framework that uses value-eliciting QA data to capture value-relevant factors while suppressing confounds, formalized through an information-bottleneck objective $I(\vartheta_{value}; Z) - H(Z)$ and optimized with an InfoNCE loss. The authors generate ~1M QA pairs from 87 core values across 15 LLMs and 25 languages, evaluating with k-NN and linear probing on four value corpora, and show that UniVaR outperforms semantic baselines by substantial margins while revealing coherent cultural clusters in a cross-language value map. The results provide a quantitative and visual basis for comparing LLMs’ value priors across cultures, supporting transparency and accountability in AI alignment efforts. The work also discusses limitations in coverage and translation artifacts, and offers to release code and models to enable broader evaluation and expansion of value taxonomies.

Abstract

The widespread application of LLMs across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, there is an urgent need to understand the scope and nature of human values injected into these LLMs before their deployment and adoption. We propose UniVaR, a high-dimensional neural representation of symbolic human value distributions in LLMs, orthogonal to model architecture and training data. This is a continuous and scalable representation, self-supervised from the value-relevant output of 8 LLMs and evaluated on 15 open-source and commercial LLMs. Through UniVaR, we visualize and explore how LLMs prioritize different values in 25 languages and cultures, shedding light on complex interplay between human values and language modeling.

High-Dimension Human Value Representation in Large Language Models

TL;DR

This work tackles the challenge of understanding how large language models encode human values by introducing UniVaR, a high-dimensional, language- and model-invariant value embedding. UniVaR is learned via a self-supervised, multi-view Siamese framework that uses value-eliciting QA data to capture value-relevant factors while suppressing confounds, formalized through an information-bottleneck objective and optimized with an InfoNCE loss. The authors generate ~1M QA pairs from 87 core values across 15 LLMs and 25 languages, evaluating with k-NN and linear probing on four value corpora, and show that UniVaR outperforms semantic baselines by substantial margins while revealing coherent cultural clusters in a cross-language value map. The results provide a quantitative and visual basis for comparing LLMs’ value priors across cultures, supporting transparency and accountability in AI alignment efforts. The work also discusses limitations in coverage and translation artifacts, and offers to release code and models to enable broader evaluation and expansion of value taxonomies.

Abstract

The widespread application of LLMs across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, there is an urgent need to understand the scope and nature of human values injected into these LLMs before their deployment and adoption. We propose UniVaR, a high-dimensional neural representation of symbolic human value distributions in LLMs, orthogonal to model architecture and training data. This is a continuous and scalable representation, self-supervised from the value-relevant output of 8 LLMs and evaluated on 15 open-source and commercial LLMs. Through UniVaR, we visualize and explore how LLMs prioritize different values in 25 languages and cultures, shedding light on complex interplay between human values and language modeling.
Paper Structure (47 sections, 3 equations, 9 figures, 7 tables)

This paper contains 47 sections, 3 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: UMAP Visualization of our UniVaR value embeddings. Each dot represents a pair of a value-eliciting question and the answer from a specific LLM in a certain language (15 LLMs and 25 languages in total). The distribution reflects distances and similarities between different cultures in terms of human values.
  • Figure 2: Overview of UniVaR. Left: our objective is to learn a value embedding $Z$ that represents the value-relevant factor $\vartheta_\mathrm{value}$ of an LLM. Middle: we elicit LLM values through QA, such that the $\vartheta_\mathrm{value}$ is expressed by the distribution of its value eliciting QA set $X$. Right: we apply multi-view learning to eliminate irrelevant information while preserving value-relevant aspects.
  • Figure 3: Value-eliciting QA generation pipeline for training. A total of 4296 English value-eliciting questions are synthesized from a set of 87 human values for training UniVaR and the diversity is enhanced through paraphrasing each question. Each question is translated into multiple languages and fed into LLMs to get the value-eliciting answers in those languages. All QA pairs are then translated back into English to minimize the linguistic variation across QAs. At the end, we obtain $\sim$1M QA pairs for training.
  • Figure 4: Performance comparison of UniVaR between value-eliciting QAs and non-value-eliciting QAs from LIMA zhou2023lima. The influence of non-value-related confounders in UniVaR is minimal compared to baselines signifies by the substantial performance gap between the two tasks.
  • Figure 5: (left) Grouped map of UniVaR value representation. (right) 2023 version of Inglehart–Welzel Cultural Map . The UniVaR value representations demonstrates relations between LLM values and human cultures where similar cultures tend to be clustered together within the same region, while unrelated cultures tend to be disjoint and located far apart from one to another forming regional values.
  • ...and 4 more figures