Table of Contents
Fetching ...

A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang, Jiwung Hyun, Sungho Park, Kyubyong Park

TL;DR

The paper addresses the non-English performance gap in multilingual LMs by introducing Polyglot-Ko, a family of Korean-focused large language models trained on a large, carefully preprocessed Korean corpus. It describes four model sizes (1.3B, 3.8B, 5.8B, 12.8B) trained with GPT-NeoX, using a morpheme-aware Byte-Pair Encoding tokenizer and MeCab for Korean morphology, and evaluates them on the KOBEST benchmark across COPA, HellaSwag, BoolQ, and SentiNeg. The results show that the 12.8B model generally delivers the best performance across most few-shot and zero-shot settings, though some baselines excel in specific tasks and WiC remains challenging; prompt design also significantly impacts SentiNeg outcomes. Limitations include data preprocessing issues, potential generation of harmful content, and hardware constraints, with a roadmap to scale to 40B parameters and to build East-Asian and Romance multilingual variants to broaden accessibility and impact in Korean NLP and beyond.

Abstract

Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models. Despite the availability of various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often resort to building monolingual models in their respective languages due to the dissatisfaction with the current multilingual models non-English language capabilities. Addressing this gap, we seek to develop advanced multilingual language models that offer improved performance in non-English languages. In this paper, we introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature. In collaboration with TUNiB, our team collected 1.2TB of Korean data meticulously curated for our research journey. We made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models. This choice was motivated by multiple factors: firstly, the Korean models facilitated performance comparisons with existing multilingual models; and finally, they catered to the specific needs of Korean companies and researchers. This paper presents our work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models.

A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

TL;DR

The paper addresses the non-English performance gap in multilingual LMs by introducing Polyglot-Ko, a family of Korean-focused large language models trained on a large, carefully preprocessed Korean corpus. It describes four model sizes (1.3B, 3.8B, 5.8B, 12.8B) trained with GPT-NeoX, using a morpheme-aware Byte-Pair Encoding tokenizer and MeCab for Korean morphology, and evaluates them on the KOBEST benchmark across COPA, HellaSwag, BoolQ, and SentiNeg. The results show that the 12.8B model generally delivers the best performance across most few-shot and zero-shot settings, though some baselines excel in specific tasks and WiC remains challenging; prompt design also significantly impacts SentiNeg outcomes. Limitations include data preprocessing issues, potential generation of harmful content, and hardware constraints, with a roadmap to scale to 40B parameters and to build East-Asian and Romance multilingual variants to broaden accessibility and impact in Korean NLP and beyond.

Abstract

Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models. Despite the availability of various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often resort to building monolingual models in their respective languages due to the dissatisfaction with the current multilingual models non-English language capabilities. Addressing this gap, we seek to develop advanced multilingual language models that offer improved performance in non-English languages. In this paper, we introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature. In collaboration with TUNiB, our team collected 1.2TB of Korean data meticulously curated for our research journey. We made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models. This choice was motivated by multiple factors: firstly, the Korean models facilitated performance comparisons with existing multilingual models; and finally, they catered to the specific needs of Korean companies and researchers. This paper presents our work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models.
Paper Structure (17 sections, 3 figures, 5 tables)

This paper contains 17 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The performance metrics of COPA (top left), HellaSwag (top right), SentiNeg (bottom left), and BoolQ (bottom right) tasks using the KOBEST dataset, with all metrics measured using the F1 score.
  • Figure 2: The 5-shot performance of Polyglot-Ko models on each task demonstrates a clear trend that as the compute increases, the performance improves.
  • Figure 3: The performance metrics of the SentiNeg task with a modified prompt (left) and WiC task (right) using the KOBEST dataset, with the F1 scores for SentiNeg while the accuracy for WiC to show random performance.