Table of Contents
Fetching ...

Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark

Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, Hwalsuk Lee

TL;DR

The paper introduces the Open Ko-LLM Leaderboard and Ko-H5 Benchmark to standardize and extend Korean LLM evaluation, aligning with the English-led Open LLM Leaderboard and using private test sets to mitigate data leakage. It details a rigorous curation pipeline, including translation and domain-aware review, and demonstrates that Ko-H5 adds linguistic diversity through Ko-CommonGen v2 while maintaining low overlap with training data. Temporal and size-type analyses reveal stepwise performance gains, a strong link between pretraining and instruction-tuning, and task-specific saturation dynamics, motivating expansion beyond fixed benchmarks. The work emphasizes community involvement and evolving benchmark practices to better reflect real-world use cases and linguistic diversity in Korean NLP applications.

Abstract

This paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as vital tools for evaluating Large Language Models (LLMs) in Korean. Incorporating private test sets while mirroring the English Open LLM Leaderboard, we establish a robust evaluation framework that has been well integrated in the Korean LLM community. We perform data leakage analysis that shows the benefit of private test sets along with a correlation study within the Ko-H5 benchmark and temporal analyses of the Ko-H5 score. Moreover, we present empirical support for the need to expand beyond set benchmarks. We hope the Open Ko-LLM Leaderboard sets precedent for expanding LLM evaluation to foster more linguistic diversity.

Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark

TL;DR

The paper introduces the Open Ko-LLM Leaderboard and Ko-H5 Benchmark to standardize and extend Korean LLM evaluation, aligning with the English-led Open LLM Leaderboard and using private test sets to mitigate data leakage. It details a rigorous curation pipeline, including translation and domain-aware review, and demonstrates that Ko-H5 adds linguistic diversity through Ko-CommonGen v2 while maintaining low overlap with training data. Temporal and size-type analyses reveal stepwise performance gains, a strong link between pretraining and instruction-tuning, and task-specific saturation dynamics, motivating expansion beyond fixed benchmarks. The work emphasizes community involvement and evolving benchmark practices to better reflect real-world use cases and linguistic diversity in Korean NLP applications.

Abstract

This paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as vital tools for evaluating Large Language Models (LLMs) in Korean. Incorporating private test sets while mirroring the English Open LLM Leaderboard, we establish a robust evaluation framework that has been well integrated in the Korean LLM community. We perform data leakage analysis that shows the benefit of private test sets along with a correlation study within the Ko-H5 benchmark and temporal analyses of the Ko-H5 score. Moreover, we present empirical support for the need to expand beyond set benchmarks. We hope the Open Ko-LLM Leaderboard sets precedent for expanding LLM evaluation to foster more linguistic diversity.
Paper Structure (31 sections, 13 figures, 6 tables)

This paper contains 31 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Data curation process for the Ko-H5 benchmark. We perform thorough human review of the machine translation results by culturally aligning the reviewers with the Korean language. Additionally, we perform filtering for data that require specific domain knowledge and re-translate them with translators that are trained with the required domain knowledge.
  • Figure 2: Correlation between the different tasks in the Ko-H5 benchmark are shown in a heatmap format, with values ranging from $-1$ to $1$. Generally speaking, Ko-TruthfulQA and Ko-CommonGen v2 have lower correlation with other tasks.
  • Figure 3: Correlation between the different tasks in the Ko-H5 benchmark for different model size brackets are shown. The overall trend changes noticeably as the model size increases. Specifically, Ko-TruthfulQA and Ko-CommonGen v2 show low, or sometimes negative, values with other tasks in smaller model sizes whereas bigger models report higher correlation values.
  • Figure 4: Ko-H5 score over time for different model sizes are shown. The time tick is set for every two weeks. The score for the zero to three billion bracket are considerably lower than that of the other two brackets.
  • Figure 5: Ko-H5 score over time for different model types are shown. The time tick is set for every two weeks. The performance trend of the instruction-tuned models follow the trend of the pretrained models.
  • ...and 8 more figures