Table of Contents
Fetching ...

CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

Linhao Yu, Yongqi Leng, Yufei Huang, Shang Wu, Haixin Liu, Xinmeng Ji, Jiahui Zhao, Jinwang Song, Tingting Cui, Xiaoqing Cheng, Tao Liu, Deyi Xiong

TL;DR

This work addresses the scarcity of culturally aligned Chinese morality benchmarks for evaluating LLMs. It introduces CMoralEval, a large-scale dataset (~30k entries) built from two data sources (a moral TV program and published moral anomalies), organized into a five-category taxonomy and five fundamental principles, with narrators and RoT annotations. The dataset supports two scenario types (explicit moral and moral dilemma) and uses AI-assisted annotation to enable scalable generation and rigorous quality control, with evaluations conducted across 26 Chinese LLMs in zero-shot and few-shot settings. Findings indicate substantial room for improvement in moral reasoning among current Chinese LLMs, though larger models like Yi-34B-Chat show stronger performance in certain categories, underscoring the benchmark’s utility for guiding alignment research and future dataset enhancement.

Abstract

What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at \url{https://github.com/tjunlp-lab/CMoralEval}.

CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

TL;DR

This work addresses the scarcity of culturally aligned Chinese morality benchmarks for evaluating LLMs. It introduces CMoralEval, a large-scale dataset (~30k entries) built from two data sources (a moral TV program and published moral anomalies), organized into a five-category taxonomy and five fundamental principles, with narrators and RoT annotations. The dataset supports two scenario types (explicit moral and moral dilemma) and uses AI-assisted annotation to enable scalable generation and rigorous quality control, with evaluations conducted across 26 Chinese LLMs in zero-shot and few-shot settings. Findings indicate substantial room for improvement in moral reasoning among current Chinese LLMs, though larger models like Yi-34B-Chat show stronger performance in certain categories, underscoring the benchmark’s utility for guiding alignment research and future dataset enhancement.

Abstract

What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at \url{https://github.com/tjunlp-lab/CMoralEval}.
Paper Structure (23 sections, 9 figures, 9 tables)

This paper contains 23 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The overall pipeline for collecting questions in CMoralEval. Scene denotes an objective description of an event; Narrator encompasses various characters involved in the event; RoT refers to a descriptive cultural norm structured as the judgment of an action forbes-etal-2020-social. Each narrator corresponds to a specific RoT, and this pairing is referred to as a Narrator-RoT pair. Contravening Reasons are legitimate justifications that may be perceived as contradicting the “ RoTs”. A Narrator-RoT pair is used for Generating Options, which uses ChatGPT for assistance in the generating process. The highlighted text with yellow background represents different narrators in the basic scene. The highlighted text with grey background denotes a contravening reason in the new scene. The detailed generating process is described in Appendix \ref{['appendix: Generating different scenarios']}.
  • Figure 2: Five-shot results on the various subdivisions of CMoralEval. EMS_1: Explicit moral scenarios from TV programs; EMS_2: Explicit moral scenarios from collected moral anomies; MDS_1: Moral dilemma scenarios from TV programs; MDS_2: Moral dilemma scenarios from collected moral anomies; party/standby stands for different narrators; moral/unmoral stands for evaluating LLMs by choosing moral/unmoral options.
  • Figure 3: Few-shot results across categories of CMoralEval.
  • Figure 4: Few-shot results on CMoralEval for single-category and multi-category questions. “ -only ” denotes single-category questions; “ -mixed ” denotes multi-category questions.
  • Figure 5: Few-shot results on CMoralEval with controlling variables. The “ _moral_or_not” suffix denotes that we calculate the accuracy that questions are answered both correctly in choosing appropriate and inappropriate options. The “ _party_or_not” suffix denotes that we calculate the accuracy that questions are answered both correctly when LLMs are treated in both party and standby settings.
  • ...and 4 more figures