Table of Contents
Fetching ...

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective

Yipeng Kang, Junqi Wang, Yexin Li, Mengmeng Wang, Wenming Tu, Quansen Wang, Hengli Li, Tingjun Wu, Xue Feng, Fangwei Zhong, Zilong Zheng

TL;DR

This paper investigates whether LLM value dimensions align structurally with human values and posits a latent causal value graph for modeling these dimensions. It shows that, even after alignment training, the graph differs from human value systems and uses it to design two lightweight steering methods—role-based prompting and sparse autoencoder (SAE) steering—to control multiple values while predicting potential side effects. SAE is found to offer finer-grained, more targeted steering than role prompts, with experiments on Gemma-2B-IT and Llama3-8B-IT demonstrating effective and controllable value modulation guided by the causal graph. The work provides a practical framework for reliable value alignment in LLMs and highlights the ongoing gap between machine and human value structures, along with limitations and avenues for future research.

Abstract

As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained approach to value steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective

TL;DR

This paper investigates whether LLM value dimensions align structurally with human values and posits a latent causal value graph for modeling these dimensions. It shows that, even after alignment training, the graph differs from human value systems and uses it to design two lightweight steering methods—role-based prompting and sparse autoencoder (SAE) steering—to control multiple values while predicting potential side effects. SAE is found to offer finer-grained, more targeted steering than role prompts, with experiments on Gemma-2B-IT and Llama3-8B-IT demonstrating effective and controllable value modulation guided by the causal graph. The work provides a practical framework for reliable value alignment in LLMs and highlights the ongoing gap between machine and human value structures, along with limitations and avenues for future research.

Abstract

As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained approach to value steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.
Paper Structure (29 sections, 1 equation, 7 figures, 5 tables)

This paper contains 29 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Steering multiple causally related value dimensions in LLMs. When we use prompts or sparse autoencoders to steer certain dimensions of a large model, other values will correspondingly change.
  • Figure 2: A general framework for role playing and SAE value steering. Within the prompt template, we can adjust the role settings (indicated in red) or directly manipulate the SAE features of specific tokens (indicated in yellow). To guide the LLMs to answer questions in a chain-of-thought (CoT) manner, we provided two in-context examples (indicated in green). Finally, we input a specific question regarding a value, and the LLM outputs both the thought process and the answer. The same steering direction on a value can be reflected on different questions.
  • Figure 3: Our value causal graphs for Gemma-2B-IT (left) and Llama3-8B-IT (right), compared to the reference graph, which is annotated by GPT-4o guided by Schwartz’s Theory. We reduce the edges of the graphs while maintaining the partial order between any two nodes unchanged by transitive reduction algorithm.
  • Figure 4: The steering effects of role prompts and SAE on expected and unexpected value dimensions for Gemma-2B-IT (left) and Llama3-8B-IT (right). Our casual graph is discovered from training data while the reference causal graph is generated by GPT-4o guided by the Schwartz’s Theory of Basic Values, as described in Appendix \ref{['appendix: ref_graph']}. Note that all tests are conducted on the test set, which uses completely different roles and value questions than those used to build the causal graph.
  • Figure 5: Causal graph generated by Gemma-2B-IT (red), Llama3-8B-IT (orange) and ValueBench upper-dimension information (purple).
  • ...and 2 more figures