Table of Contents
Fetching ...

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He

TL;DR

FoundaBench addresses the lack of a standardized, culturally relevant benchmark for fundamental knowledge in Chinese LLMs. It proposes a taxonomy-based benchmark with 3354 MCQs spanning common sense and K-12 content, paired with psychometric quality control and CircularEval to mitigate response biases. Empirical evaluation across 12 models shows Chinese-pretrained LLMs outperform English-centric ones, with a pronounced gap between memory recall and reasoning capabilities and CircularEval exposing susceptibility to guessing. The work provides a rigorous baseline and a scalable framework to guide future development of Chinese fundamental knowledge in LLMs.

Abstract

In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

TL;DR

FoundaBench addresses the lack of a standardized, culturally relevant benchmark for fundamental knowledge in Chinese LLMs. It proposes a taxonomy-based benchmark with 3354 MCQs spanning common sense and K-12 content, paired with psychometric quality control and CircularEval to mitigate response biases. Empirical evaluation across 12 models shows Chinese-pretrained LLMs outperform English-centric ones, with a pronounced gap between memory recall and reasoning capabilities and CircularEval exposing susceptibility to guessing. The work provides a rigorous baseline and a scalable framework to guide future development of Chinese fundamental knowledge in LLMs.

Abstract

In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.
Paper Structure (16 sections, 7 figures, 4 tables)

This paper contains 16 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview diagram of the FoundaBench
  • Figure 2: Improved process for question generation
  • Figure 3: Evaluation of all models
  • Figure 4: Hard example for correspondence reasoning question
  • Figure 5: Hard example for time calculation reasoning question
  • ...and 2 more figures