Table of Contents
Fetching ...

Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning

Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu

TL;DR

DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations is proposed.

Abstract

Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.

Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning

TL;DR

DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations is proposed.

Abstract

Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.
Paper Structure (48 sections, 21 equations, 5 figures, 27 tables, 2 algorithms)

This paper contains 48 sections, 21 equations, 5 figures, 27 tables, 2 algorithms.

Figures (5)

  • Figure 1: (a) Examples of discriminative and generative evaluations. (b) Illustration of our generative evaluation framework, DeNEVIL. (c) Depiction of our in-context alignment method, VILMO.
  • Figure 2: Ethical value evaluation results. The higher the EVR, APV, and MVP, the greater the extent to which the LLM violates values. We assess both open-source and OpenAI black-box LLMs, and report results averaged on all foundations. See Appendix. \ref{['section: additional_results_analyses']} for separate results on each foundation.
  • Figure 3: (a) The comparison of discriminative and generative evaluations on LLaMA-70B, LLaMA-70B-Chat, Text-Davinci-003, and ChatGPT. (b) Evaluation results (APV) using moral prompts constructed through ChatGPT and Vicuna-33B, respectively. (c) Value violation of LLaMA and ChatGPT using prompts produced by themselves with varying DeNEVIL iteration rounds.
  • Figure 4: (a) Human evaluation results. Krippendorff's Alpha of 0.82 indicates an acceptable inter-annotator agreement. (b) Trade-off curve of value violation (APV) and completion diversity of ChatGPT over the number of iterative augmentation of VILMO. (c) A similar trade-off curve of value violation and completion coherence. (d) Samples of ChatGPT aligned by different models. The words that express violation and conformity are marked in red and green, respectively.
  • Figure : The DeNEVIL Framework