Table of Contents
Fetching ...

Kanana: Compute-efficient Bilingual Language Models

Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo

TL;DR

Kanana addresses the high compute costs of large language models by presenting a compute-efficient bilingual LLM family that excels in Korean while remaining competitive in English. The approach combines data-efficient pre-training on a 3 trillion-token bilingual corpus with staged pre-training, depth up-scaling, and iterative pruning/distillation, followed by instruction-focused post-training using supervised fine-tuning and preference optimization. The work also demonstrates practical adaptations, including embedding backbones, retrieval-augmented generation, and function calling in Korean, achieving strong benchmarks at smaller scales and highlighting a viable path toward accessible, multilingual LLM development. Overall, Kanana offers a scalable framework that reduces training costs, improves Korean capabilities, and enables versatile downstream applications with potential impact on research and industry.

Abstract

We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.

Kanana: Compute-efficient Bilingual Language Models

TL;DR

Kanana addresses the high compute costs of large language models by presenting a compute-efficient bilingual LLM family that excels in Korean while remaining competitive in English. The approach combines data-efficient pre-training on a 3 trillion-token bilingual corpus with staged pre-training, depth up-scaling, and iterative pruning/distillation, followed by instruction-focused post-training using supervised fine-tuning and preference optimization. The work also demonstrates practical adaptations, including embedding backbones, retrieval-augmented generation, and function calling in Korean, achieving strong benchmarks at smaller scales and highlighting a viable path toward accessible, multilingual LLM development. Overall, Kanana offers a scalable framework that reduces training costs, improves Korean capabilities, and enables versatile downstream applications with potential impact on research and industry.

Abstract

We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.

Paper Structure

This paper contains 33 sections, 14 figures, 19 tables.

Figures (14)

  • Figure 1: Performance to pre-training computational cost for Kanana and comparable models. We measure computational cost in FLOPs (Floating Point Operations), which is approximately calculated as 6 $\times$ training tokens $\times$ model size kaplan2020scalinglaws. We only calculate student training FLOPs for distillation models. Obviously, Kanana models achieves decent performance given their limited computational cost.
  • Figure 2: Kanana's staged pre-training data mixture.
  • Figure 3: Data size and proportion of each domain.
  • Figure 4: Kanana model performance for each stage of training across different model sizes. The y-axis is the average of normalized scores of all benchmarks in \ref{['table:chat-eval-2']} and \ref{['table:chat-eval-1']}. The normalization process is done by dividing each score with the maximum possible score.
  • Figure 5: Performance Comparison of Various Models Based on averaged helpfulness and grounding in RAG-General-Bench.
  • ...and 9 more figures