Table of Contents
Fetching ...

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

Bowen Jiang, Runchuan Zhu, Jiang Wu, Zinco Jiang, Yifan He, Junyuan Gao, Jia Yu, Rui Min, Yinfan Wang, Haote Yang, Songyang Zhang, Dahua Lin, Lijun Wu, Conghui He

TL;DR

KoLasSimpleQA introduces a multilingual, knowledge-oriented QA benchmark spanning nine languages to evaluate LLM factual ability in two domains: general/global and language-specific knowledge. It employs a robust construction pipeline using inter-language Wikipedia links, GPT-4o for triple-and QA-generation, and a two-stage quality control with LLM-based judging to produce 2,147 high-quality QA pairs. The study reveals pronounced domain disparities, showing translation to English boosts general-domain performance but not language-specific accuracy, and demonstrates calibration and knowledge-memorization gaps in language-specific content. It also analyzes the reasoning processes of Large Reasoning Models, highlighting overhead and bidirectional-memorization differences between domains. Overall, KoLasSimpleQA provides a valuable framework for targeted multilingual evaluation and model optimization, guiding future improvements in multilingual factual capabilities.

Abstract

We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at https://github.com/opendatalab/KoLasSimpleQA .

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

TL;DR

KoLasSimpleQA introduces a multilingual, knowledge-oriented QA benchmark spanning nine languages to evaluate LLM factual ability in two domains: general/global and language-specific knowledge. It employs a robust construction pipeline using inter-language Wikipedia links, GPT-4o for triple-and QA-generation, and a two-stage quality control with LLM-based judging to produce 2,147 high-quality QA pairs. The study reveals pronounced domain disparities, showing translation to English boosts general-domain performance but not language-specific accuracy, and demonstrates calibration and knowledge-memorization gaps in language-specific content. It also analyzes the reasoning processes of Large Reasoning Models, highlighting overhead and bidirectional-memorization differences between domains. Overall, KoLasSimpleQA provides a valuable framework for targeted multilingual evaluation and model optimization, guiding future improvements in multilingual factual capabilities.

Abstract

We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at https://github.com/opendatalab/KoLasSimpleQA .

Paper Structure

This paper contains 26 sections, 13 figures, 26 tables.

Figures (13)

  • Figure 1: Construction pipeline of KoLasSimpleQA. The process includes Wikipedia entry selection based on inter-language links, triple and QA pair generation using GPT-4o, and a two-stage quality control to ensure question quality and diversity.
  • Figure 2: Illustration of inter-language links on a Wikipedia page. The number of such links ($n_{\text{ill}}$) is used to distinguish between language-specific and language-general knowledge.
  • Figure 3: Example QA pairs in KoLasSimpleQA.
  • Figure 4: Example QA pairs in the reverse relationship in KoLasSimpleQA.
  • Figure 5: (a) Model performance (F-score) ranking in general and langugage-specific domains. The models are sorted based on the general domain (blue line). (b) Differences in F-scores between the tran_en and the direct settings (a value greater than zero indicates that tran_en performs better). (c) Proportion of bidirectional correctness ($P_\text{bi}$) for general and specific domain questions across models.
  • ...and 8 more figures