Kanana: Compute-efficient Bilingual Language Models
Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo
TL;DR
Kanana addresses the high compute costs of large language models by presenting a compute-efficient bilingual LLM family that excels in Korean while remaining competitive in English. The approach combines data-efficient pre-training on a 3 trillion-token bilingual corpus with staged pre-training, depth up-scaling, and iterative pruning/distillation, followed by instruction-focused post-training using supervised fine-tuning and preference optimization. The work also demonstrates practical adaptations, including embedding backbones, retrieval-augmented generation, and function calling in Korean, achieving strong benchmarks at smaller scales and highlighting a viable path toward accessible, multilingual LLM development. Overall, Kanana offers a scalable framework that reduces training costs, improves Korean capabilities, and enables versatile downstream applications with potential impact on research and industry.
Abstract
We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.
