Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?

Shaoyang Xu; Weilong Dong; Zishan Guo; Xinwei Wu; Deyi Xiong

Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?

Shaoyang Xu, Weilong Dong, Zishan Guo, Xinwei Wu, Deyi Xiong

TL;DR

This paper empirically confirms the presence of value concepts within LLMs in a multilingual format and validates the feasibility of cross-lingual control over value alignment capabilities of LLMs, leveraging the dominant language as a source language.

Abstract

Prior research has revealed that certain abstract concepts are linearly represented as directions in the representation space of LLMs, predominantly centered around English. In this paper, we extend this investigation to a multilingual context, with a specific focus on human values-related concepts (i.e., value concepts) due to their significance for AI safety. Through our comprehensive exploration covering 7 types of human values, 16 languages and 3 LLM series with distinct multilinguality (e.g., monolingual, bilingual and multilingual), we first empirically confirm the presence of value concepts within LLMs in a multilingual format. Further analysis on the cross-lingual characteristics of these concepts reveals 3 traits arising from language resource disparities: cross-lingual inconsistency, distorted linguistic relationships, and unidirectional cross-lingual transfer between high- and low-resource languages, all in terms of value concepts. Moreover, we validate the feasibility of cross-lingual control over value alignment capabilities of LLMs, leveraging the dominant language as a source language. Ultimately, recognizing the significant impact of LLMs' multilinguality on our results, we consolidate our findings and provide prudent suggestions on the composition of multilingual data for LLMs pre-training.

Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?

TL;DR

Abstract

Paper Structure (61 sections, 2 equations, 7 figures, 9 tables)

This paper contains 61 sections, 2 equations, 7 figures, 9 tables.

Introduction
Related Work
Representation Engineering
Multilinguality of LLMs
Multilingual AI Safety
Exploring Multilingual Value Concepts
Collecting Multilingual Concept Vectors
Recognizing Multilingual Concepts
Calculating Cross-Lingual Similarity of Concept Vectors
Recognizing Cross-Lingual Concepts
Experiments
Experimental Setup
Human Value Datasets
Examined Languages and LLMs
Q1: Do LLMs Encode Concepts Representing Human Values in Multiple Languages?
...and 46 more sections

Figures (7)

Figure 1: Multilingual concept recognition accuracy (%) of LLaMA2-chat, Qwen-chat and BLOOMZ series, averaged across all value concepts. The performance of the three 7B-sized models are connected with dashed lines for performance comparison. "Represented languages" refer to the languages present in the pre-training corpus.
Figure 2: (a) Multilingual concept recognition accuracy across different model layers. (b) Cross-lingual similarity of concept vectors across different model layers. Results are averaged across languages included both in LLaMA2-chat and BLOOMZ series' pre-training data, as well as across all human values.
Figure 3: Cross-lingual similarity of concept vectors across all language pairs, averaged over all value concepts. The languages included in each model's pre-training data are presented and sorted based on their proportions in the corresponding model's pre-training data. For Qwen-chat series, we conjecture its language inclusion based on multilingual concept recognition accuracy (§\ref{['sec:multilingual concept classification']}) and display its primary languages, zh and en, at the forefront.
Figure 4: Cross-lingual concept transferability across all language pairs, averaged over all value concepts. Languages are sorted based on their percentages in the pre-training data.
Figure 5: English concept recognition accuracy with varying numbers of training samples for collecting concept vectors. The result are based on LLaMA2-chat-13B. We calculate the average accuracy across all layers to ensure the results of different settings are comparable.
...and 2 more figures

Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?

TL;DR

Abstract

Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)