Table of Contents
Fetching ...

KLUE: Korean Language Understanding Evaluation

Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Jiyoon Han, Jangwon Park, Chisung Song, Junseong Kim, Yongsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, Seongbo Jang, Seungwon Do, Sunkyoung Kim, Kyungtae Lim, Jongwon Lee, Kyumin Park, Jamin Shin, Seonghyun Kim, Lucy Park, Alice Oh, Jung-Woo Ha, Kyunghyun Cho

TL;DR

KLUE presents a comprehensive Korean Language Understanding Evaluation benchmark built from the ground up to avoid translation artifacts and licensing restrictions. It spans eight NLU tasks—Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking—drawn from ten diverse Korean corpora with careful preprocessing, ethical filtering, and annotated guidelines. The authors release strong baselines based on Korean-language pretrained models (KLUE-BERT and KLUE-RoBERTa) and provide detailed task-specific fine-tuning configurations, showing that Korean-specific PLMs outperform multilingual counterparts and that larger models generally yield higher accuracy, with morpheme-based tokenization delivering benefits in morpheme-level tasks. They also demonstrate that removing PII has minimal impact on downstream performance, and they emphasize open-access licensing to accelerate future research and enable reproducibility. Overall, KLUE serves as a standard framework for advancing Korean NLP, offering reproducible benchmarks, data processing protocols, and pretrained models for broader research and cross-lingual study.

Abstract

We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby facilitate future research. We make a few interesting observations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTa-large outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at https://klue-benchmark.com.

KLUE: Korean Language Understanding Evaluation

TL;DR

KLUE presents a comprehensive Korean Language Understanding Evaluation benchmark built from the ground up to avoid translation artifacts and licensing restrictions. It spans eight NLU tasks—Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking—drawn from ten diverse Korean corpora with careful preprocessing, ethical filtering, and annotated guidelines. The authors release strong baselines based on Korean-language pretrained models (KLUE-BERT and KLUE-RoBERTa) and provide detailed task-specific fine-tuning configurations, showing that Korean-specific PLMs outperform multilingual counterparts and that larger models generally yield higher accuracy, with morpheme-based tokenization delivering benefits in morpheme-level tasks. They also demonstrate that removing PII has minimal impact on downstream performance, and they emphasize open-access licensing to accelerate future research and enable reproducibility. Overall, KLUE serves as a standard framework for advancing Korean NLP, offering reproducible benchmarks, data processing protocols, and pretrained models for broader research and cross-lingual study.

Abstract

We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby facilitate future research. We make a few interesting observations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTa-large outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at https://klue-benchmark.com.

Paper Structure

This paper contains 166 sections, 1 equation, 10 figures, 36 tables, 1 algorithm.

Figures (10)

  • Figure 1: Label distributions generated by RTT (top) and GSM (bottom) in AIRBNB.
  • Figure 2: Similarity score distribution of the train (top) and dev (bottom) set. The scores of dev set is close to uniform distribution across range 0$-$5. The scores are rounded to the first decimal place.
  • Figure 3: An example of BIO scheme for NER tagging. The sentence is translated as: "<CNBlue:PS> is the best♥!!!! So sad <the next week:DT> is their last weekT.T Nooooo!!" where 씨엔블루 (CNBlue) is a rock band of Korea. 담주 (the next week) is tagged as DT here, while it is agglutinated with a functional word 가 (is) in this sentence and is separately annotated with the character-level BIO scheme.
  • Figure 4: Annotation tool for crowdsourcing. Main features are translated in English with red color.
  • Figure 5: An example of dependency parsing, that translates to "Chul-Soo ate an apple."
  • ...and 5 more figures