Exploring and steering the moral compass of Large Language Models

Alejandro Tlaie

Exploring and steering the moral compass of Large Language Models

Alejandro Tlaie

TL;DR

The paper probes the ethical alignment of eight LLMs by evaluating them on classical dilemmas and the Moral Foundations Questionnaire, then introduces a causal intervention method, SARA, to steer the models' reasoning by adjusting activation patterns via $SVD$ in a prompt-level, non-retraining manner. It finds proprietary, closed models lean toward utilitarian ethics, open-weight models align with values-based ethics, and MFQ reveals liberal bias across most models except Llama-2, highlighting entrenched biases in deployed LLMs. SARA demonstrates controllable shifts in moral reasoning directions (e.g., Kantian vs Utilitarian) without changing outcome decisions, offering a scalable safety/alignment tool and emphasizing the socio-technical nature of AI ethics.

Abstract

Large Language Models (LLMs) have become central to advancing automation and decision-making across various sectors, raising significant ethical questions. This study proposes a comprehensive comparative analysis of the most advanced LLMs to assess their moral profiles. We subjected several state-of-the-art models to a selection of ethical dilemmas and found that all the proprietary ones are mostly utilitarian and all of the open-weights ones align mostly with values-based ethics. Furthermore, when using the Moral Foundations Questionnaire, all models we probed - except for Llama 2-7B - displayed a strong liberal bias. Lastly, in order to causally intervene in one of the studied models, we propose a novel similarity-specific activation steering technique. Using this method, we were able to reliably steer the model's moral compass to different ethical schools. All of these results showcase that there is an ethical dimension in already deployed LLMs, an aspect that is generally overlooked.

Exploring and steering the moral compass of Large Language Models

TL;DR

in a prompt-level, non-retraining manner. It finds proprietary, closed models lean toward utilitarian ethics, open-weight models align with values-based ethics, and MFQ reveals liberal bias across most models except Llama-2, highlighting entrenched biases in deployed LLMs. SARA demonstrates controllable shifts in moral reasoning directions (e.g., Kantian vs Utilitarian) without changing outcome decisions, offering a scalable safety/alignment tool and emphasizing the socio-technical nature of AI ethics.

Abstract

Paper Structure (3 sections, 6 figures)

This paper contains 3 sections, 6 figures.

Ethical dilemmas
Moral profiles
SARA: Similarity-based Activation Steering with Repulsion and Attraction

Figures (6)

Figure 1: Ethical dilemmas as a probe for LLM moral reasoningA) Ethical alignment with different human traditions. All models have a general tendency towards utilitarianism. The most balanced model is Claude-3-Sonnet. B) Alignment, split by model type. Open models are significantly more deontological and closed LLMs are more similar to utilitarian viewpoints. C) Classification agreement. We measured inter-scorer agreement (between both classifiers we used (GPT-4-Turbo-2024-04-09 and Claude 3 Opus) via the Adjusted Mutual Information. Rectangles show the $1^{st}$ and $99^{th}$ percentiles of the corresponding surrogate distribution. D) Ethical consistency. Response consistency is in general low for all models ($<60\%$). The least reliable models are Claude-3-Sonnet and Llama-2. Vertical lines indicate $90\%$ confidence intervals.
Figure 2: Moral profiles for all models. All models are heavily liberal-biased, except for Llama-2, which is more aligned with conservative values; the most liberally-biased one is Claude-3-Sonnet; the one best representing the average US citizen is GPT-4. In general, all models, except for Llama-2, align with the moral schema of a young Western liberal with a high level of education, engaged in social causes, and with a great openness to experience, empathy, and compassion.
Figure 3: A) Schematic of how SARA works. For details, see Methods - Activation Steering: SARAB) Example responses: unsteered (gray), Utilitarian-steered (orange) and Kantian-steered (blue). The decision (reporting the parent) is the same, the reasoning is changed. C) How different responses are classified (using the same method as in the previous section) when doing each steering. As it can be seen, SARA is effective at steering model responses in different conceptual directions. D) SARA is more effective when intervening at early or late layers, rather than at the intermediate ones.
Figure S1: Comparison with another steering method. SARA (more saturated colors) steers responses in a more pronounced way than Activation Addition (ActAdd, a similar steering method proposed in turner2023activation). We also report that SARA has a smaller spillover steering effect than ActAdd. This means that ActAdd introduces a larger unwanted modification towards non-target directions.
Figure S2: Dissected variable model response to ethical dilemmas.A) Transition graph for proprietary models. There are three main absorbing responses (meaning, high self-transition probability): Virtue Ethics, Rule Utilitarianism and Act Utilitarianism. On the other hand, Ethical Altruism, Theory of Rights and Prima Face Duties are bridging states (very low self-transition probabilitiy). B) Covariance matrix for proprietary models.
...and 1 more figures

Exploring and steering the moral compass of Large Language Models

TL;DR

Abstract

Exploring and steering the moral compass of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)