Table of Contents
Fetching ...

Understanding How Value Neurons Shape the Generation of Specified Values in LLMs

Yi Su, Jiayi Zhang, Shu Yang, Xinhai Wang, Lijie Hu, Di Wang

TL;DR

We address the opacity of internal value representations in LLMs by introducing ValueLocate, a mechanistic interpretability framework anchored to the Schwartz Value Survey. ValueInsight is built with 640 value descriptions and 15,000 situational prompts across the four higher-order dimensions Openness to Change, Self-Transcendence, Conservation, and Self-Enhancement to identify value-related neurons via activation contrasts between opposing aspects. Causality is validated by editing neuron activations with a dynamic scaling factor gamma to steer value orientations, demonstrating robust control over model values. Across four LM families, results show value-related neurons are sparse yet sufficient to induce consistent changes, contributing a principled groundwork for value alignment that couples psychological theory with neural mechanisms.

Abstract

Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.

Understanding How Value Neurons Shape the Generation of Specified Values in LLMs

TL;DR

We address the opacity of internal value representations in LLMs by introducing ValueLocate, a mechanistic interpretability framework anchored to the Schwartz Value Survey. ValueInsight is built with 640 value descriptions and 15,000 situational prompts across the four higher-order dimensions Openness to Change, Self-Transcendence, Conservation, and Self-Enhancement to identify value-related neurons via activation contrasts between opposing aspects. Causality is validated by editing neuron activations with a dynamic scaling factor gamma to steer value orientations, demonstrating robust control over model values. Across four LM families, results show value-related neurons are sparse yet sufficient to induce consistent changes, contributing a principled groundwork for value alignment that couples psychological theory with neural mechanisms.

Abstract

Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.

Paper Structure

This paper contains 25 sections, 7 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: ValueInsight Construction and Usage
  • Figure 2: Mainstream process of ValueLocate
  • Figure 3: Results of positively and negatively editing the neurons identified by ValueLocate, as well as editing randomly selected neurons, on LLama-3.1-8B and Gemma-2-9B.
  • Figure 4: LLama-3.1-8B Neuron Distribution
  • Figure 5: Impact of Value-Related Neuron and Random Neuron Manipulation on LLama-3.1-8B
  • ...and 11 more figures