Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu, Yuemei Xu, Xiaoyang Gu, Letao Han
TL;DR
This paper tackles the challenge of interpreting how non-Western social values are encoded and manifested in large language models. It introduces ValueExploration, a three-step framework that builds a bilingual value benchmark (C-Voice), identifies value-specific neurons using an entropy-based activation analysis in FFN layers, and assesses causal influence by deactivating these neurons to observe behavioral shifts. The study demonstrates, across four LLMs, that value-specific neurons exist and that language context modulates value alignment, with perturbations producing measurable changes in value-driven decisions. By revealing a causal link between neuron activity and value-oriented behavior, the work advances interpretability and alignment research, and provides a publicly available benchmark and code to support further cross-cultural analysis of LLM values.
Abstract
Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.
