Table of Contents
Fetching ...

Korean Named Entity Recognition Based on Language-Specific Features

Yige Chen, KyungTae Lim, Jungyeul Park

TL;DR

The paper addresses the challenge of Korean NER by exploiting language-specific morphology and proposes a morpheme-based CoNLL-U representation with an automatic conversion algorithm from word-based corpora. It develops CRF, RNN, and BERT/XLM-RoBERTa–based NER models that jointly utilize POS features (UPOS and XPOS) and demonstrates consistent improvements over eojeol- and syllable-based annotations across multiple datasets. The contributions include the conversion tooling, a unified POS-NER representation, and comprehensive comparisons across annotation schemes and transformer models, with XLM-RoBERTa often achieving the best performance. This work has practical implications for Korean NLP applications such as information extraction, search, and machine translation by enabling more accurate named entity recognition in Korean text.

Abstract

In the paper, we propose a novel way of improving named entity recognition in the Korean language using its language-specific features. While the field of named entity recognition has been studied extensively in recent years, the mechanism of efficiently recognizing named entities in Korean has hardly been explored. This is because the Korean language has distinct linguistic properties that prevent models from achieving their best performances. Therefore, an annotation scheme for {Korean corpora} by adopting the CoNLL-U format, which decomposes Korean words into morphemes and reduces the ambiguity of named entities in the original segmentation that may contain functional morphemes such as postpositions and particles, is proposed herein. We investigate how the named entity tags are best represented in this morpheme-based scheme and implement an algorithm to convert word-based {and syllable-based Korean corpora} with named entities into the proposed morpheme-based format. Analyses of the results of {statistical and neural} models reveal that the proposed morpheme-based format is feasible, and the {varied} performances of the models under the influence of various additional language-specific features are demonstrated. Extrinsic conditions were also considered to observe the variance of the performances of the proposed models, given different types of data, including the original segmentation and different types of tagging formats.

Korean Named Entity Recognition Based on Language-Specific Features

TL;DR

The paper addresses the challenge of Korean NER by exploiting language-specific morphology and proposes a morpheme-based CoNLL-U representation with an automatic conversion algorithm from word-based corpora. It develops CRF, RNN, and BERT/XLM-RoBERTa–based NER models that jointly utilize POS features (UPOS and XPOS) and demonstrates consistent improvements over eojeol- and syllable-based annotations across multiple datasets. The contributions include the conversion tooling, a unified POS-NER representation, and comprehensive comparisons across annotation schemes and transformer models, with XLM-RoBERTa often achieving the best performance. This work has practical implications for Korean NLP applications such as information extraction, search, and machine translation by enabling more accurate named entity recognition in Korean text.

Abstract

In the paper, we propose a novel way of improving named entity recognition in the Korean language using its language-specific features. While the field of named entity recognition has been studied extensively in recent years, the mechanism of efficiently recognizing named entities in Korean has hardly been explored. This is because the Korean language has distinct linguistic properties that prevent models from achieving their best performances. Therefore, an annotation scheme for {Korean corpora} by adopting the CoNLL-U format, which decomposes Korean words into morphemes and reduces the ambiguity of named entities in the original segmentation that may contain functional morphemes such as postpositions and particles, is proposed herein. We investigate how the named entity tags are best represented in this morpheme-based scheme and implement an algorithm to convert word-based {and syllable-based Korean corpora} with named entities into the proposed morpheme-based format. Analyses of the results of {statistical and neural} models reveal that the proposed morpheme-based format is feasible, and the {varied} performances of the models under the influence of various additional language-specific features are demonstrated. Extrinsic conditions were also considered to observe the variance of the performances of the proposed models, given different types of data, including the original segmentation and different types of tagging formats.
Paper Structure (31 sections, 5 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 31 sections, 5 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Distribution of each type of postposition/particle after NEs (NER data from NAVER). The terms and notations in the figure are described in Table \ref{['ne-xpos-captions-table']}.
  • Figure 2: Various approaches of annotation for named entities (NEs): the eojeol-based approach annotates the entire word, the morpheme-based annotates only the morpheme and excludes the functional morphemes, and the syllable-based annotates syllable by syllable to exclude the functional morphemes.
  • Figure 3: CoNLL-U style annotation with multiword tokens for morphological analysis and POS tagging. It can include BIO-based NER annotation where B-LOC is for a beginning word of location and I-PER for an inside word of person.
  • Figure 4: Overall structure of our RNN-based model.
  • Figure 5: CRF feature template example for the word and named entity data set.
  • ...and 2 more figures