Table of Contents
Fetching ...

Discourse Heuristics For Paradoxically Moral Self-Correction

Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson

TL;DR

This work investigates two paradoxes in moral self-correction for LLMs: that corrections tend to be superficial and that self-diagnosis does not reliably drive effective correction. By analyzing discourse constructions in fine-tuning data, it uncovers shallow heuristics that enable self-correction and demonstrates that a general discourse approach yields inconsistent improvements across self-correction and self-diagnosis. The study shows that context and action-oriented discourse can drive self-correction even without explicit stereotype awareness, with model size influencing generalization; combining multiple stereotypes via mixed fine-tuning enhances generalization in larger models but can introduce conflicts with self-diagnosis. The authors propose leveraging these heuristics to improve moral self-correction while acknowledging generalization challenges and outlining future work, including extending the framework to other tasks and incorporating external feedback to mitigate reliance on shallow heuristics.

Abstract

Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

Discourse Heuristics For Paradoxically Moral Self-Correction

TL;DR

This work investigates two paradoxes in moral self-correction for LLMs: that corrections tend to be superficial and that self-diagnosis does not reliably drive effective correction. By analyzing discourse constructions in fine-tuning data, it uncovers shallow heuristics that enable self-correction and demonstrates that a general discourse approach yields inconsistent improvements across self-correction and self-diagnosis. The study shows that context and action-oriented discourse can drive self-correction even without explicit stereotype awareness, with model size influencing generalization; combining multiple stereotypes via mixed fine-tuning enhances generalization in larger models but can introduce conflicts with self-diagnosis. The authors propose leveraging these heuristics to improve moral self-correction while acknowledging generalization challenges and outlining future work, including extending the framework to other tasks and incorporating external feedback to mitigate reliance on shallow heuristics.

Abstract

Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

Paper Structure

This paper contains 20 sections, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Constructions in the Fine-tuning Discourse (top) and Self-correction Prompt (bottom). Each component in the discourse is aligned with their counterparts in the task prompt. Please note that there is a sub-action in the Action component, as it aligns with the self-correction instruction in the self-correction prompt. This is intended to elicit an Action which instructs how to avoid stereotypes when making choice decisions.