Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs
Hyeonwoo Kim, Dahyun Kim, Jihoo Kim, Sukyung Lee, Yungi Kim, Chanjun Park
TL;DR
Open Ko-LLM Leaderboard2 tackles the misalignment of Season 1 by introducing nine Korean-centric benchmarks, including native KorNAT tasks and practical Ko-GPQA/Ko-IFEval/Ko-Harmlessness/Ko-Helpfulness, replacing translated English equivalents. It adopts cost-efficient, GPT-free automated evaluation and a mix of generation- and logit-based tasks, with native Korean data and private datasets to ensure fair assessments. Empirical analyses show shifts toward generation-based evaluation, reduced evaluation times, and nuanced cross-task correlations that better reflect real-world utility, despite weaker cross-season alignment for generation tasks. Overall, the framework provides a more meaningful, scalable, Korean-language benchmarking platform that emphasizes real-world applicability and safer, more helpful model behavior.
Abstract
The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.
