Table of Contents
Fetching ...

KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

Dongjun Jang, Sungjoo Byun, Hyemi Jo, Hyopil Shin

TL;DR

KIT-19 is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks, and has the potential to make a substantial contribution to the future improvement of Korean LLMs’ performance.

Abstract

Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce \textit{KIT-19} as an instruction dataset for the development of LLM in Korean. \textit{KIT-19} is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using \textit{KIT-19} to demonstrate its effectiveness. The experimental results show that the model trained on \textit{KIT-19} significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that \textit{KIT-19} has the potential to make a substantial contribution to the future improvement of Korean LLMs' performance.

KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

TL;DR

KIT-19 is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks, and has the potential to make a substantial contribution to the future improvement of Korean LLMs’ performance.

Abstract

Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce \textit{KIT-19} as an instruction dataset for the development of LLM in Korean. \textit{KIT-19} is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using \textit{KIT-19} to demonstrate its effectiveness. The experimental results show that the model trained on \textit{KIT-19} significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that \textit{KIT-19} has the potential to make a substantial contribution to the future improvement of Korean LLMs' performance.
Paper Structure (28 sections, 3 figures, 6 tables)

This paper contains 28 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: A glance at the KIT for Korean LLM: We create instruction datasets by drawing from 19 Korean NLP datasets across 10 different categories. We utilize 'kowiki_text' as a source dataset for both Closed Book QA and Next Sentence Prediction tasks.
  • Figure 2: Overview of the data construction procedure of KIT-19
  • Figure 3: Instruction Template used to construct KIT-19. Above is the example of one of the datasets used for STS task. We employ 10 unique templates for each dataset, resulting in a total of 200 templates.