Table of Contents
Fetching ...

KORMo: Korean Open Reasoning Model for Everyone

Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim

TL;DR

KORMo presents the first fully open bilingual Korean–English LLM largely trained on synthetic data, achieving competitive performance with open-weight multilingual baselines. Through careful design choices—including Pre-LN normalization, intra-document masking, and a bilingual tokenization strategy—the authors demonstrate stable pretraining and robust multilingual generalization. A two-stage pretraining curriculum, explicit long-context and reasoning mid-training, and thorough SFT with instruction-following demonstrate strong Korean instruction and reasoning coherence, while preserving solid English capabilities. By releasing data, code, and logs, KORMo promotes reproducibility and provides a concrete framework for synthetic-data–driven open models in low-resource languages, with future work focused on reasoning-oriented RL and broader multilingual expansion.

Abstract

This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.

KORMo: Korean Open Reasoning Model for Everyone

TL;DR

KORMo presents the first fully open bilingual Korean–English LLM largely trained on synthetic data, achieving competitive performance with open-weight multilingual baselines. Through careful design choices—including Pre-LN normalization, intra-document masking, and a bilingual tokenization strategy—the authors demonstrate stable pretraining and robust multilingual generalization. A two-stage pretraining curriculum, explicit long-context and reasoning mid-training, and thorough SFT with instruction-following demonstrate strong Korean instruction and reasoning coherence, while preserving solid English capabilities. By releasing data, code, and logs, KORMo promotes reproducibility and provides a concrete framework for synthetic-data–driven open models in low-resource languages, with future work focused on reasoning-oriented RL and broader multilingual expansion.

Abstract

This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.

Paper Structure

This paper contains 70 sections, 11 figures, 22 tables, 1 algorithm.

Figures (11)

  • Figure 1: Compression trends by data ratio in the English setting. The x-axis represents the synthetic–crawl ratio (left: synthetic-dominant, right: crawl-dominant), and the y-axis shows compression efficiency measured in bytes per token (BPT), where higher values indicate greater efficiency.
  • Figure 2: Compression trends by data ratio in the Korean setting. The x-axis represents the synthetic–crawl ratio (left: synthetic-dominant, right: crawl-dominant), and the y-axis shows compression efficiency measured in bytes per token (BPT), where higher values indicate greater efficiency.
  • Figure 3: Comparison of English and Korean compression performance between the tokenizer candidates defined in Table \ref{['tab:candidate_tokenizer']} and commercial tokenizers (GPT-4, LLaMA).
  • Figure 4: Proportion of Korean tokens within tokenizer vocabularies across different models. Each bar represents the share of Korean (KR) versus non-Korean (Other) tokens. While English-centric models such as LLaMA and GPT exhibit minimal Korean coverage (1.8% and 0.3%, respectively), Korean-specialized models (Exaone4, HyperCLOVAX, A.X-4.0, Midm and ours) show significantly higher proportions.
  • Figure 5: Distribution of quality scores assigned to 400K Korean community-OSCAR samples using Qwen3-32B with translated FineWeb-Edu scoring prompts. The majority of samples were rated as 0, indicating low or no educational value.
  • ...and 6 more figures