Table of Contents
Fetching ...

Evaluating Multimodal Generative AI with Korean Educational Standards

Sanghee Park, Geewook Kim

TL;DR

KoNET addresses the lack of Korean multimodal educational benchmarks by converting four national tests into a multimodal VQA dataset that includes human error data for KoCSAT. The study benchmarks a wide range of open- and closed-source LLMs and MLLMs, employing Chain-of-Thought prompts and OCR, and uses an LLM-as-a-Judge framework to standardize evaluation. Key findings show performance improves with model size but reveals a larger gap for open-source models in Korean contexts, and demonstrate that linguistic and cultural specificity significantly impact AI performance. By releasing an open-source dataset-builder and detailed analyses of human vs AI error patterns, KoNET aims to drive reproducible, language-aware progress in multimodal educational AI and tutoring applications.

Abstract

This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.

Evaluating Multimodal Generative AI with Korean Educational Standards

TL;DR

KoNET addresses the lack of Korean multimodal educational benchmarks by converting four national tests into a multimodal VQA dataset that includes human error data for KoCSAT. The study benchmarks a wide range of open- and closed-source LLMs and MLLMs, employing Chain-of-Thought prompts and OCR, and uses an LLM-as-a-Judge framework to standardize evaluation. Key findings show performance improves with model size but reveals a larger gap for open-source models in Korean contexts, and demonstrate that linguistic and cultural specificity significantly impact AI performance. By releasing an open-source dataset-builder and detailed analyses of human vs AI error patterns, KoNET aims to drive reproducible, language-aware progress in multimodal educational AI and tutoring applications.

Abstract

This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.

Paper Structure

This paper contains 22 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Examples and Performance Overview of KoNET. (a) Illustration of mathematics problem examples, highlighting the increased complexity and difficulty as the educational level progresses. (b) Demonstration of how the accuracy of contemporary AI models decreases with more advanced curricula. A detailed analysis is provided in Section \ref{['sec:exp_analysis']}.
  • Figure 2: Correlation analysis of error rates. The x-axis shows human error rates, and the y-axis displays error rates from closed-source models. Appendix \ref{['appendix:more_analyses_on_human_error_rate']} offers a detailed discussion on the methods used to calculate these error rates.
  • Figure 3: Illustrative Representation of the KoNET. The test includes various types of questions, such as those requiring comprehension of images and queries, reading and understanding of lengthy texts, and simple knowledge-based queries.
  • Figure 4: Examples of prompt formats used in the study. These include Direct prompts for answer extraction, CoT (Chain-of-Thought) prompts for reasoning-based inference, and Judge prompts for evaluating the accuracy of generated responses.
  • Figure 5: Performance of LLMs and MLLMs across Previous benchmarks and KoNET. These present a performance comparison between LLMs and MLLMs across various benchmarks, including KoNET. These illustrate the accuracy distribution for each model type, but KoNET shows a different distribution trend between LLMs and MLLMs compared to other benchmarks.
  • ...and 4 more figures