Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Guangliang Liu; Haitao Mao; Jiliang Tang; Kristen Marie Johnson

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Guangliang Liu, Haitao Mao, Jiliang Tang, Kristen Marie Johnson

TL;DR

It is argued that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states.

Abstract

Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. However, the process of how injecting self-correction instructions can modify the behavior of LLMs remains under-explored. In this paper, we explore the effectiveness of moral self-correction by answering three research questions: (1) In what scenarios does moral self-correction work? (2) What are the internal mechanisms of LLMs, e.g., hidden states, that are influenced by moral self-correction instructions? (3) Is intrinsic moral self-correction actually superficial in terms of reduced immorality in hidden states? We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states. Through empirical investigation with tasks of language generation and multi-choice question answering, we conclude:(i) LLMs exhibit good performance across both tasks, and self-correction instructions are particularly beneficial when the correct answer is already top-ranked; (ii) The morality levels in intermediate hidden states are strong indicators as to whether one instruction would be more effective than another; (iii) Based on our analysis of intermediate hidden states and task case studies of self-correction behaviors, we are first to propose the hypothesis that intrinsic moral self-correction is in fact superficial.

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

TL;DR

It is argued that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states.

Abstract

Paper Structure (21 sections, 8 figures, 2 tables)

This paper contains 21 sections, 8 figures, 2 tables.

Introduction
Related Works
Scenarios for Moral Self-Correction
Experimental Settings
Experimental Results
Mechanisms of Moral Self-Correction
Experimental Settings
Effects of Layer-wise Hidden States
Effects of Attention and FFLs
Effectiveness of Instructions
Superficial Hypothesis
Discussion
Conclusion
Future Works
Appendix
...and 6 more sections

Figures (8)

Figure 1: Moral Self-correction Performance Evaluated using BBQ, Winogender, and RealToxicity Benchmarks. For the BBQ and Winogender benchmarks, the self-correction process was applied iteratively five times. The fairness score, indicative of reduced bias in the models' outputs, was reported for these benchmarks. Notably, higher fairness scores correspond to lower levels of bias. Conversely, for the RealToxicity benchmark, the evaluation metric was the toxicity score, with lower scores indicating better performance in reducing toxic outputs. More results for BBQ are available in Appendix \ref{['fig:addMainResult4BBQ']}.
Figure 2: Results of Probing Experiments for RealToxicity, Winogender, and the Age Bias of BBQ Benchmarks. The x-axis indicates the index of layers. For each benchmark, the average similarity of layer-wise hidden states to the probing vector is reported where lower scores are better. The Baseline represents the performance without self-correction instructions. For enhanced clarity, we present the results for rounds 1, 3, and 5 of BBQ and Winogender, and rounds 1, 3, 5, and 7 of RealToxicity. Additional results are available in Appendix \ref{['fig:addinternal']}.
Figure 3: Average Similarity Across Self-correction Rounds, with an Emphasis on Attention Heads and Feed-forward Layers. For RealToxicity, we consider layers 23 through the final layer, while for Winogender and BBQ, we analyze layers 15 through 28. For attention heads, we take the output from the module of output projection (e.g., model.layers.0.self_attn.o_proj) and the output from modules of down projection operations (e.g., model.layers.0.mlp.down_proj). Additional results for other social bias dimensions of BBQ are available in Appendix \ref{['fig:additionalResults4AttMLP']}.
Figure 4: Self-correction Instructions Across Various Specificity Levels. We show their similarity to bias w.r.t. layer-wise hidden states. The performance of these instructions is: 0.633 for specificity-0, 0.642 for specificity-1 and 1.00 for specificity-2 which directly injects the ground-truth label.
Figure 5: Main Result for Self-correction Performance Over the Disability and Physical Bias Dimensions.
...and 3 more figures

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

TL;DR

Abstract

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Authors

TL;DR

Abstract

Table of Contents

Figures (8)