LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Chuang Liu; Renren Jin; Yuqi Ren; Deyi Xiong

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Chuang Liu, Renren Jin, Yuqi Ren, Deyi Xiong

TL;DR

LHMKE is designed to provide a comprehensive evaluation of the knowledge acquisition capabilities of Chinese LLMs, and includes both objective and subjective questions, offering a more holistic evaluation of the knowledge level of LLMs.

Abstract

Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications. However, the existing benchmarks for comprehensively evaluating these LLMs are still insufficient, particularly in terms of measuring knowledge that LLMs capture. Current datasets collect questions from Chinese examinations across different subjects and educational levels to address this issue. Yet, these benchmarks primarily focus on objective questions such as multiple-choice questions, leading to a lack of diversity in question types. To tackle this problem, we propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark in this paper. LHMKE is designed to provide a comprehensive evaluation of the knowledge acquisition capabilities of Chinese LLMs. It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams. Notably, LHMKE includes both objective and subjective questions, offering a more holistic evaluation of the knowledge level of LLMs. We have assessed 11 Chinese LLMs under the zero-shot setting, which aligns with real examinations, and compared their performance across different subjects. We also conduct an in-depth analysis to check whether GPT-4 can automatically score subjective predictions. Our findings suggest that LHMKE is a challenging and advanced testbed for Chinese LLMs.

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

TL;DR

Abstract

Paper Structure (20 sections, 3 figures, 9 tables)

This paper contains 20 sections, 3 figures, 9 tables.

Introduction
Our main contributions in the paper:
Related work
LHMKE
the Group of Elementary and Secondary School
the Group of College
the Group of Career Development
Dataset Statistic
Experiments
Assessed LLMs
Evaluation Metrics
Results
Analysis
Comparing LLM Performance between Objective and Subjective Questions
Analysis for Evaluating Subjective Question
...and 5 more sections

Figures (3)

Figure 1: Examples in LHMKE. The yellow example of a objective question with single-choice from Western Medicine subject. The green example of a objective question with multi-choice from Psychological Counselor subject. The blue example of subjective question with writing from Teacher Certification . The orange example of subjective question with conditional analysis from Construction Practical Examination.
Figure 2: Main subjects in LHMKE.
Figure 3: Comparing each LLM's performance in objective questions vs. subjective questions.

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

TL;DR

Abstract

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)