Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

Edward Y. Chang

Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

Edward Y. Chang

TL;DR

The paper addresses biases and reward-hacking risks in RLHF-based alignment by introducing $\mathsf{DIKE}$, a decoupled behavioral oversight layer for Large Language Models that separates behavior from knowledge. It combines Diagnostics, Interpretation, Knowledge-independent learning, and Ethical guardrails, with an adversarial module $\mathsf{ERIS}$ to enable culture-aware governance and transparent oversight. A quantitative emotional model and self-supervised emotion–behavior mappings underpin behavior rectification, while adversarial in-context reviews balance ethics with free speech across cultures. Pilot studies using love-letter corpora demonstrate improved emotion-behavior classification and effective checks-and-balances for rectifying outputs, signaling a path toward more accountable, culturally sensitive AI interactions. The work highlights practical impact for content moderation, mental health support, and cross-cultural AI ethics, while outlining clear directions for expanding emotional granularity and validating guardrails across diverse contexts.

Abstract

This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our innovative approaches include mapping emotions and behaviors using self-supervised learning techniques, refining these guardrails through adversarial reviews, and systematically adjusting outputs to ensure ethical alignment. This framework establishes a robust foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for more responsible and context-aware AI interactions.

Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

TL;DR

The paper addresses biases and reward-hacking risks in RLHF-based alignment by introducing

, a decoupled behavioral oversight layer for Large Language Models that separates behavior from knowledge. It combines Diagnostics, Interpretation, Knowledge-independent learning, and Ethical guardrails, with an adversarial module

to enable culture-aware governance and transparent oversight. A quantitative emotional model and self-supervised emotion–behavior mappings underpin behavior rectification, while adversarial in-context reviews balance ethics with free speech across cultures. Pilot studies using love-letter corpora demonstrate improved emotion-behavior classification and effective checks-and-balances for rectifying outputs, signaling a path toward more accountable, culturally sensitive AI interactions. The work highlights practical impact for content moderation, mental health support, and cross-cultural AI ethics, while outlining clear directions for expanding emotional granularity and validating guardrails across diverse contexts.

Abstract

Paper Structure (15 sections, 6 figures, 10 tables)

This paper contains 15 sections, 6 figures, 10 tables.

Introduction
Related Work
Emotion and Emotion-Behavior Modeling
Reinforcement Learning with Human/AI Feedback, RLHF vs. RLAIF
Challenges and Theoretical Considerations
Quantitative Models of Emotions, Behaviors, and Ethics
Development of a Quantitative Emotional Model
Development of Cognitive Frameworks to Regulate Linguistic Behaviors
Adversarial In-Context Review to Balance Ethics and Free Speech
Pilot Studies
Emotion Layer Evaluation
Behavior Classification
Adversarial Evaluation and Rectification
Conclusion
Appendix H: Instruction to Human Annotators

Figures (6)

Figure 1: Emotion distributions in behaviors
Figure 2: Classification accuracy and entropy
Figure 3: SocraSynth Agents and Roles.
Figure 4: Changes in arguments of GPT-4 at different contentiousness levels
Figure 5: Comparative display of emotional models. These models include only the “basic” emotions. Complex emotions can be modeled with basic emotions.
...and 1 more figures

Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

TL;DR

Abstract

Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)