Table of Contents
Fetching ...

Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

Hyeonwoo Kim, Dahyun Kim, Jihoo Kim, Sukyung Lee, Yungi Kim, Chanjun Park

TL;DR

Open Ko-LLM Leaderboard2 tackles the misalignment of Season 1 by introducing nine Korean-centric benchmarks, including native KorNAT tasks and practical Ko-GPQA/Ko-IFEval/Ko-Harmlessness/Ko-Helpfulness, replacing translated English equivalents. It adopts cost-efficient, GPT-free automated evaluation and a mix of generation- and logit-based tasks, with native Korean data and private datasets to ensure fair assessments. Empirical analyses show shifts toward generation-based evaluation, reduced evaluation times, and nuanced cross-task correlations that better reflect real-world utility, despite weaker cross-season alignment for generation tasks. Overall, the framework provides a more meaningful, scalable, Korean-language benchmarking platform that emphasizes real-world applicability and safer, more helpful model behavior.

Abstract

The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.

Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

TL;DR

Open Ko-LLM Leaderboard2 tackles the misalignment of Season 1 by introducing nine Korean-centric benchmarks, including native KorNAT tasks and practical Ko-GPQA/Ko-IFEval/Ko-Harmlessness/Ko-Helpfulness, replacing translated English equivalents. It adopts cost-efficient, GPT-free automated evaluation and a mix of generation- and logit-based tasks, with native Korean data and private datasets to ensure fair assessments. Empirical analyses show shifts toward generation-based evaluation, reduced evaluation times, and nuanced cross-task correlations that better reflect real-world utility, despite weaker cross-season alignment for generation tasks. Overall, the framework provides a more meaningful, scalable, Korean-language benchmarking platform that emphasizes real-world applicability and safer, more helpful model behavior.

Abstract

The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.

Paper Structure

This paper contains 19 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Screenshot of the Open Ko-LLM Leaderboard interface showing the current rankings of models evaluated in Season 2. The interface displays model names, overall performance scores, and task-specific results. Users can view detailed evaluation metrics for each model, enabling comparisons based on both quantitative and qualitative performance. This transparent interface encourages healthy competition, fosters continuous improvement, and provides a real-time overview of Korean LLM development progress.
  • Figure 2: Monthly submission trends for Season 1 of the Open Ko-LLM Leaderboard from September 2023 to July 2024.
  • Figure 3: Example model answers to the same questions from one of top-ranking AI models from Season 1 (left) and Season 2 (right).
  • Figure 4: Correlation matrices for pre-trained models (left) and fine-tuned models (right) between Season 1 and Season 2 scores.
  • Figure 5: Correlation between the nine new tasks in the Season 2 Open Ko-LLM Leaderboard.