Table of Contents
Fetching ...

GECKO: Generative Language Model for English, Code and Korean

Sungwoo Oh, Donggyu Kim

TL;DR

GECKO addresses the need for strong Korean-English bilingual proficiency with programming capabilities by training from scratch on a balanced Korean-English-Code corpus using a decoder-only Transformer in the LLaMA family. The authors develop a data-centric pipeline featuring balanced sampling, high-quality Korean data collection, and multilingual alignment, paired with a BPE tokenizer of 32k vocab and a training setup incorporating rotary embeddings and long-context handling up to 8192 tokens. Empirical evaluation shows GECKO achieves leading performance on Korean KMMLU among open-source models and reasonable results on English and Code benchmarks, while remaining open-source to accelerate Korean LLM research. The work offers a practical baseline and insights into data processing, tokenizer design, and scalable pretraining for Korean-centric LLMs, with future work including instruction tuning and expanded resources.

Abstract

We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture. In this report, we share the experiences of several efforts to build a better data pipeline for the corpus and to train our model. GECKO shows great efficiency in token generations for both Korean and English, despite its small size of vocabulary. We measure the performance on the representative benchmarks in terms of Korean, English and Code, and it exhibits great performance on KMMLU (Korean MMLU) and modest performance in English and Code, even with its smaller number of trained tokens compared to English-focused LLMs. GECKO is available to the open-source community under a permissive license. We hope our work offers a research baseline and practical insights for Korean LLM research. The model can be found at: https://huggingface.co/kifai/GECKO-7B

GECKO: Generative Language Model for English, Code and Korean

TL;DR

GECKO addresses the need for strong Korean-English bilingual proficiency with programming capabilities by training from scratch on a balanced Korean-English-Code corpus using a decoder-only Transformer in the LLaMA family. The authors develop a data-centric pipeline featuring balanced sampling, high-quality Korean data collection, and multilingual alignment, paired with a BPE tokenizer of 32k vocab and a training setup incorporating rotary embeddings and long-context handling up to 8192 tokens. Empirical evaluation shows GECKO achieves leading performance on Korean KMMLU among open-source models and reasonable results on English and Code benchmarks, while remaining open-source to accelerate Korean LLM research. The work offers a practical baseline and insights into data processing, tokenizer design, and scalable pretraining for Korean-centric LLMs, with future work including instruction tuning and expanded resources.

Abstract

We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture. In this report, we share the experiences of several efforts to build a better data pipeline for the corpus and to train our model. GECKO shows great efficiency in token generations for both Korean and English, despite its small size of vocabulary. We measure the performance on the representative benchmarks in terms of Korean, English and Code, and it exhibits great performance on KMMLU (Korean MMLU) and modest performance in English and Code, even with its smaller number of trained tokens compared to English-focused LLMs. GECKO is available to the open-source community under a permissive license. We hope our work offers a research baseline and practical insights for Korean LLM research. The model can be found at: https://huggingface.co/kifai/GECKO-7B
Paper Structure (14 sections, 1 equation, 4 figures, 2 tables)

This paper contains 14 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Distribution of pretraining data sources for bilingual language models. The left pie chart illustrates the proportional composition of the corpus by language, highlighting a balanced representation of 35% Korean, 28% English, and 37% code to accommodate low-resource language challenges. The right pie chart details the types of data utilized, with 36% web sources, 24% from Wikipedia, 16% from news articles, 16% from books, 5% from patents, and 3% from translated texts. This distribution supports efforts to enhance model performance by diversifying and balancing the training data across different types and languages.
  • Figure 2: Pipeline for cleansing corpus
  • Figure 3: Example of normalization for a wiki dataset: The left image displays the original data, while the right image shows the preprocessed and normalized data in markdown format.
  • Figure 4: Comparative analysis of tokenizer efficiency across multiple language models. This graph illustrates the performance of various tokenizers, including GECKO, Polyglot-Ko, LLaMA-2, Mistral, and GPT-4, across Korean, English, and code text corpora. The y-axis represents token efficiency as a percentage, with higher values indicating superior encoding performance relative to the tokenizer of GECKO. This analysis highlights the varying efficiency levels each model exhibits, offering insights into how effectively each tokenizer encodes multilingual and coding data. The dashed red line at 100% serves as a benchmark for baseline efficiency.