Table of Contents
Fetching ...

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Chuang Liu, Linhao Yu, Jiaxuan Li, Renren Jin, Yufei Huang, Ling Shi, Junhui Zhang, Xinmeng Ji, Tingting Cui, Tao Liu, Jinwang Song, Hongying Zan, Sun Li, Deyi Xiong

TL;DR

OpenEval introduces a multi-dimensional evaluation framework for Chinese LLMs that jointly assesses capability, alignment, and safety across 53 tasks drawn from 25 datasets, totaling roughly 300K questions. It establishes a dynamic benchmark strategy and transparent leaderboards to keep evaluations aligned with rapid model development, demonstrated by a first public run with 9 open-source and 5 proprietary LLMs. Key findings show proprietary models excel in disciplinary knowledge and mathematical reasoning but lag in alignment and safety, while open-source LLMs display varying strengths in NLP tasks and better alignment in some cases, with commonsense reasoning remaining a challenge. The work provides a scalable, adaptable platform that can guide future improvements and monitoring of Chinese LLMs through API, local, and online evaluation modes.

Abstract

The rapid development of Chinese large language models (LLMs) poses big challenges for efficient LLM evaluation. While current initiatives have introduced new benchmarks or evaluation platforms for assessing Chinese LLMs, many of these focus primarily on capabilities, usually overlooking potential alignment and safety issues. To address this gap, we introduce OpenEval, an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety. For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning. For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs. To evaluate safety, especially anticipated risks (e.g., power-seeking, self-awareness) of advanced LLMs, we include 6 datasets. In addition to these benchmarks, we have implemented a phased public evaluation and benchmark update strategy to ensure that OpenEval is in line with the development of Chinese LLMs or even able to provide cutting-edge benchmark datasets to guide the development of Chinese LLMs. In our first public evaluation, we have tested a range of Chinese LLMs, spanning from 7B to 72B parameters, including both open-source and proprietary models. Evaluation results indicate that while Chinese LLMs have shown impressive performance in certain tasks, more attention should be directed towards broader aspects such as commonsense reasoning, alignment, and safety.

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

TL;DR

OpenEval introduces a multi-dimensional evaluation framework for Chinese LLMs that jointly assesses capability, alignment, and safety across 53 tasks drawn from 25 datasets, totaling roughly 300K questions. It establishes a dynamic benchmark strategy and transparent leaderboards to keep evaluations aligned with rapid model development, demonstrated by a first public run with 9 open-source and 5 proprietary LLMs. Key findings show proprietary models excel in disciplinary knowledge and mathematical reasoning but lag in alignment and safety, while open-source LLMs display varying strengths in NLP tasks and better alignment in some cases, with commonsense reasoning remaining a challenge. The work provides a scalable, adaptable platform that can guide future improvements and monitoring of Chinese LLMs through API, local, and online evaluation modes.

Abstract

The rapid development of Chinese large language models (LLMs) poses big challenges for efficient LLM evaluation. While current initiatives have introduced new benchmarks or evaluation platforms for assessing Chinese LLMs, many of these focus primarily on capabilities, usually overlooking potential alignment and safety issues. To address this gap, we introduce OpenEval, an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety. For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning. For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs. To evaluate safety, especially anticipated risks (e.g., power-seeking, self-awareness) of advanced LLMs, we include 6 datasets. In addition to these benchmarks, we have implemented a phased public evaluation and benchmark update strategy to ensure that OpenEval is in line with the development of Chinese LLMs or even able to provide cutting-edge benchmark datasets to guide the development of Chinese LLMs. In our first public evaluation, we have tested a range of Chinese LLMs, spanning from 7B to 72B parameters, including both open-source and proprietary models. Evaluation results indicate that while Chinese LLMs have shown impressive performance in certain tasks, more attention should be directed towards broader aspects such as commonsense reasoning, alignment, and safety.
Paper Structure (27 sections, 9 figures, 1 table)

This paper contains 27 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of the evaluation taxonomy and used datasets in OpenEval.
  • Figure 2: Main results in the first public Chinese LLM evaluation with OpenEval.
  • Figure 3: OpenEval provides a user-friendly interface, enabling users to effortlessly conduct comprehensive evaluations of LLMs.
  • Figure 4: Results over the NLP tasks evaluation subdimension.
  • Figure 5: Results of the disciplinary knowledge evaluation subdimension.
  • ...and 4 more figures