Table of Contents
Fetching ...

Eagle: Ethical Dataset Given from Real Interactions

Masahiro Kaneko, Danushka Bollegala, Timothy Baldwin

TL;DR

The paper addresses the gap between synthetic ethical benchmarks and real user prompts by introducing Eagle, a real-conversation dataset extracted from ChatGPT interactions. It formalizes a likelihood-based unethical score (LLS) to quantify model propensity for unethical outputs and demonstrates that Eagle correlates poorly with existing datasets, underscoring the limitations of prior benchmarks. Through few-shot mitigation experiments, Eagle-based prompts more effectively reduce unethical outputs than prompts built from existing datasets, highlighting the value of real-use data for safety in LLMs. Limitations include language scope and data from a single service, but the authors publicly release code to support broader evaluation and benchmarking.

Abstract

Recent studies have demonstrated that large language models (LLMs) have ethical-related problems such as social biases, lack of moral reasoning, and generation of offensive content. The existing evaluation metrics and methods to address these ethical challenges use datasets intentionally created by instructing humans to create instances including ethical problems. Therefore, the data does not reflect prompts that users actually provide when utilizing LLM services in everyday contexts. This may not lead to the development of safe LLMs that can address ethical challenges arising in real-world applications. In this paper, we create Eagle datasets extracted from real interactions between ChatGPT and users that exhibit social biases, toxicity, and immoral problems. Our experiments show that Eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges. Our code is publicly available at https://huggingface.co/datasets/MasahiroKaneko/eagle.

Eagle: Ethical Dataset Given from Real Interactions

TL;DR

The paper addresses the gap between synthetic ethical benchmarks and real user prompts by introducing Eagle, a real-conversation dataset extracted from ChatGPT interactions. It formalizes a likelihood-based unethical score (LLS) to quantify model propensity for unethical outputs and demonstrates that Eagle correlates poorly with existing datasets, underscoring the limitations of prior benchmarks. Through few-shot mitigation experiments, Eagle-based prompts more effectively reduce unethical outputs than prompts built from existing datasets, highlighting the value of real-use data for safety in LLMs. Limitations include language scope and data from a single service, but the authors publicly release code to support broader evaluation and benchmarking.

Abstract

Recent studies have demonstrated that large language models (LLMs) have ethical-related problems such as social biases, lack of moral reasoning, and generation of offensive content. The existing evaluation metrics and methods to address these ethical challenges use datasets intentionally created by instructing humans to create instances including ethical problems. Therefore, the data does not reflect prompts that users actually provide when utilizing LLM services in everyday contexts. This may not lead to the development of safe LLMs that can address ethical challenges arising in real-world applications. In this paper, we create Eagle datasets extracted from real interactions between ChatGPT and users that exhibit social biases, toxicity, and immoral problems. Our experiments show that Eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges. Our code is publicly available at https://huggingface.co/datasets/MasahiroKaneko/eagle.
Paper Structure (18 sections, 2 equations, 3 figures, 5 tables)

This paper contains 18 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The creation process for the Eagle dataset. The Eagle dataset contains actual ChatGPT-user interactions.
  • Figure 2: LLS (on the $y$-axis) shown against the number of examples used for few-shot learning (on the $x$-axis). Higher LLS values indicate a tendency to generate unethical texts, which gets reduced when increasing the number of few-shot examples for mitigation.
  • Figure 3: LLS (on the $y$-axis) shown against the number of examples used for few-shot learning (on the $x$-axis). Lower LLS indicates that few-shot examples have a greater impact on diverging the model's generative tendencies from the original output in neutral instances.