Table of Contents
Fetching ...

Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

Ling Hu, Yuemei Xu, Xiaoyang Gu, Letao Han

TL;DR

This paper tackles the challenge of interpreting how non-Western social values are encoded and manifested in large language models. It introduces ValueExploration, a three-step framework that builds a bilingual value benchmark (C-Voice), identifies value-specific neurons using an entropy-based activation analysis in FFN layers, and assesses causal influence by deactivating these neurons to observe behavioral shifts. The study demonstrates, across four LLMs, that value-specific neurons exist and that language context modulates value alignment, with perturbations producing measurable changes in value-driven decisions. By revealing a causal link between neuron activity and value-oriented behavior, the work advances interpretability and alignment research, and provides a publicly available benchmark and code to support further cross-cultural analysis of LLM values.

Abstract

Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.

Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

TL;DR

This paper tackles the challenge of interpreting how non-Western social values are encoded and manifested in large language models. It introduces ValueExploration, a three-step framework that builds a bilingual value benchmark (C-Voice), identifies value-specific neurons using an entropy-based activation analysis in FFN layers, and assesses causal influence by deactivating these neurons to observe behavioral shifts. The study demonstrates, across four LLMs, that value-specific neurons exist and that language context modulates value alignment, with perturbations producing measurable changes in value-driven decisions. By revealing a causal link between neuron activity and value-oriented behavior, the work advances interpretability and alignment research, and provides a publicly available benchmark and code to support further cross-cultural analysis of LLM values.

Abstract

Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.

Paper Structure

This paper contains 28 sections, 4 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: The pipeline of ValueExploration framework. It follows three steps: 1) construct a value benchmark; 2) identify value-specific neurons using the constucted activation benchmark; 3) evaluate LLMs’ alignment with given national social values and the influence of identified neurons.
  • Figure 2: Definitions of Chinese Social Values.
  • Figure 3: An example of generated data in Dedication.
  • Figure 4: Support rate comparison of four models across 12 Chinese Social Values, evaluated on English (left) and Chinese (right).
  • Figure 5: Impact of value-specific neurons on value support rate for four LLMs. The element at the i-th row and j-th column represents the support rate change for value i due to deactivation of the neuron for value j. Blue denotes the English test set, and red denotes the Chinese test set. We highlight significant decreases in the top four values, with deeper diagonal shades indicating a significant effect of value-specific neurons on the corresponding value.
  • ...and 8 more figures