Table of Contents
Fetching ...

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dongwon Noh, Donghyeok Koh, Junghun Yuk, Gyuwan Kim, Jaeyong Lee, Kyungtae Lim, Cheoneum Park

TL;DR

ScholarBench addresses the lack of domain-specific evaluation for LLMs by delivering a bilingual benchmark that spans eight academic domains and uses 63 English and 65 Korean attributes to probe abstraction, comprehension, and reasoning. It introduces a three-stage data construction pipeline that sources papers, generates questions with GPT-4o, and refines them via expert review, yielding 5031 Korean and 5309 English items across five question types in both closed-book and open-book settings. Experimental results across API-based and open-source models reveal task- and domain-dependent strengths with average scores around 0.54, underscoring the difficulty of academic-domain reasoning and cross-lingual transfer. The work offers nuanced analyses of abstraction, comprehension, and reasoning, discusses bilingual performance patterns, acknowledges English-language limitations, and outlines future directions including expanding domains and modalities (eg, RAG, multimodal data) and error-type analysis. The dataset is released under CC BY-ND 4.0, providing a resource for advancing scholarly AI benchmarking and domain-tuned model development.

Abstract

Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

TL;DR

ScholarBench addresses the lack of domain-specific evaluation for LLMs by delivering a bilingual benchmark that spans eight academic domains and uses 63 English and 65 Korean attributes to probe abstraction, comprehension, and reasoning. It introduces a three-stage data construction pipeline that sources papers, generates questions with GPT-4o, and refines them via expert review, yielding 5031 Korean and 5309 English items across five question types in both closed-book and open-book settings. Experimental results across API-based and open-source models reveal task- and domain-dependent strengths with average scores around 0.54, underscoring the difficulty of academic-domain reasoning and cross-lingual transfer. The work offers nuanced analyses of abstraction, comprehension, and reasoning, discusses bilingual performance patterns, acknowledges English-language limitations, and outlines future directions including expanding domains and modalities (eg, RAG, multimodal data) and error-type analysis. The dataset is released under CC BY-ND 4.0, providing a resource for advancing scholarly AI benchmarking and domain-tuned model development.

Abstract

Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

Paper Structure

This paper contains 68 sections, 12 figures, 24 tables.

Figures (12)

  • Figure 1: Model performance across categories for leading open- and closed-source LLMs on ScholarBench. Each column represents a task-specific evaluation metric. Main task-level results are reported in Table \ref{['tab:main-result']}, while detailed performance analysis by category is provided in Appendix \ref{['appx:cate-anal']}.
  • Figure 2: Taxanomy of academic categories and question attributes for English dataset.
  • Figure 3: Data construction pipeline. For a step-by-step example of data construction, see Appendix \ref{['appx:data_construction']}.
  • Figure 4: Model-wise performance on parallel data across En, Ko, and Both language settings.
  • Figure 5: Frequency distribution of 27 attributes commonly shared across all question types. The balanced distribution without overconcentration on specific attributes suggests that the benchmark enables fair model evaluation across a diverse range of attributes.
  • ...and 7 more figures