Evaluating Multimodal Generative AI with Korean Educational Standards
Sanghee Park, Geewook Kim
TL;DR
KoNET addresses the lack of Korean multimodal educational benchmarks by converting four national tests into a multimodal VQA dataset that includes human error data for KoCSAT. The study benchmarks a wide range of open- and closed-source LLMs and MLLMs, employing Chain-of-Thought prompts and OCR, and uses an LLM-as-a-Judge framework to standardize evaluation. Key findings show performance improves with model size but reveals a larger gap for open-source models in Korean contexts, and demonstrate that linguistic and cultural specificity significantly impact AI performance. By releasing an open-source dataset-builder and detailed analyses of human vs AI error patterns, KoNET aims to drive reproducible, language-aware progress in multimodal educational AI and tutoring applications.
Abstract
This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.
