GECKO: Generative Language Model for English, Code and Korean
Sungwoo Oh, Donggyu Kim
TL;DR
GECKO addresses the need for strong Korean-English bilingual proficiency with programming capabilities by training from scratch on a balanced Korean-English-Code corpus using a decoder-only Transformer in the LLaMA family. The authors develop a data-centric pipeline featuring balanced sampling, high-quality Korean data collection, and multilingual alignment, paired with a BPE tokenizer of 32k vocab and a training setup incorporating rotary embeddings and long-context handling up to 8192 tokens. Empirical evaluation shows GECKO achieves leading performance on Korean KMMLU among open-source models and reasonable results on English and Code benchmarks, while remaining open-source to accelerate Korean LLM research. The work offers a practical baseline and insights into data processing, tokenizer design, and scalable pretraining for Korean-centric LLMs, with future work including instruction tuning and expanded resources.
Abstract
We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture. In this report, we share the experiences of several efforts to build a better data pipeline for the corpus and to train our model. GECKO shows great efficiency in token generations for both Korean and English, despite its small size of vocabulary. We measure the performance on the representative benchmarks in terms of Korean, English and Code, and it exhibits great performance on KMMLU (Korean MMLU) and modest performance in English and Code, even with its smaller number of trained tokens compared to English-focused LLMs. GECKO is available to the open-source community under a permissive license. We hope our work offers a research baseline and practical insights for Korean LLM research. The model can be found at: https://huggingface.co/kifai/GECKO-7B
