Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models
Edward Y. Chang
TL;DR
The paper addresses biases and reward-hacking risks in RLHF-based alignment by introducing $\mathsf{DIKE}$, a decoupled behavioral oversight layer for Large Language Models that separates behavior from knowledge. It combines Diagnostics, Interpretation, Knowledge-independent learning, and Ethical guardrails, with an adversarial module $\mathsf{ERIS}$ to enable culture-aware governance and transparent oversight. A quantitative emotional model and self-supervised emotion–behavior mappings underpin behavior rectification, while adversarial in-context reviews balance ethics with free speech across cultures. Pilot studies using love-letter corpora demonstrate improved emotion-behavior classification and effective checks-and-balances for rectifying outputs, signaling a path toward more accountable, culturally sensitive AI interactions. The work highlights practical impact for content moderation, mental health support, and cross-cultural AI ethics, while outlining clear directions for expanding emotional granularity and validating guardrails across diverse contexts.
Abstract
This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our innovative approaches include mapping emotions and behaviors using self-supervised learning techniques, refining these guardrails through adversarial reviews, and systematically adjusting outputs to ensure ethical alignment. This framework establishes a robust foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for more responsible and context-aware AI interactions.
