Table of Contents
Fetching ...

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

Chanjun Park, Hyeonwoo Kim

TL;DR

The paper tackles the problem of limited observation periods in LLM evaluation by conducting an eleven-month longitudinal study of Korean LLMs on the Open Ko-LLM Leaderboard. It analyzes 1,769 models across five tasks (Ko-H5) to characterize performance trajectories, task correlations, and leaderboard dynamics. Key findings show rapid gains and early saturation on some tasks, stronger cross-task correlations in larger models, and a pretrained-model–driven progress bottleneck that shapes long-term improvements. The work demonstrates the value of extended leaderboard data for guiding targeted research in Korean LLM development and evaluation, with implications for benchmark design and scaling strategies.

Abstract

This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

TL;DR

The paper tackles the problem of limited observation periods in LLM evaluation by conducting an eleven-month longitudinal study of Korean LLMs on the Open Ko-LLM Leaderboard. It analyzes 1,769 models across five tasks (Ko-H5) to characterize performance trajectories, task correlations, and leaderboard dynamics. Key findings show rapid gains and early saturation on some tasks, stronger cross-task correlations in larger models, and a pretrained-model–driven progress bottleneck that shapes long-term improvements. The work demonstrates the value of extended leaderboard data for guiding targeted research in Korean LLM development and evaluation, with implications for benchmark design and scaling strategies.

Abstract

This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.
Paper Structure (11 sections, 6 figures, 2 tables)

This paper contains 11 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Performance trends of LLMs across different tasks on the Open Ko-LLM Leaderboard over a eleven-month period. The total number of submitted models is 1,769.
  • Figure 2: Correlation between task performances across different model size categories, illustrating how task correlations change with increasing model size.
  • Figure 3: Analysis of Task Correlations Over Time.
  • Figure 4: Performance Trends Over Time for Different Model Types.
  • Figure 5: Performance Trends by Model Size.
  • ...and 1 more figures