Table of Contents
Fetching ...

Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers

Jong Myoung Kim, Young-Jun Lee, Yong-jin Han, Sangkeun Jung, Ho-Jin Choi

TL;DR

This paper investigates whether Korean language models capture the language's inherent syntactic flexibility, particularly case-marker-driven word-order variations, and whether incomplete-syntax data can augment model performance. It introduces the SIKO dataset, generating five syntactically incomplete variants (CMCMdel, ShufShufSem.Presrv, ShufShufSem.Non.Presrv, and mixes) from KLUE-TC, KLUE-NLI, and AI-Hub dialogues, with 2,000 test and 20,000 training instances. Through inference experiments and fine-tuning studies across PKO-T5, Ko-GPT-Trinity, and chatGPT, the authors show that LMs reflect Korean's syntactic flexibility and that SIKO-based fine-tuning consistently improves handling of incomplete inputs, often outperforming standard data augmentation methods. The work demonstrates a cost-effective data augmentation technique that enhances robustness in Korean NLP and suggests potential applicability to other case-marker languages with flexible word order. Overall, SIKO provides a practical path to bolster LM performance on real-world Korean text featuring incomplete syntax.

Abstract

Syntactic elements, such as word order and case markers, are fundamental in natural language processing. Recent studies show that syntactic information boosts language model performance and offers clues for people to understand their learning mechanisms. Unlike languages with a fixed word order such as English, Korean allows for varied word sequences, despite its canonical structure, due to case markers that indicate the functions of sentence components. This study explores whether Korean language models can accurately capture this flexibility. We note that incomplete word orders and omitted case markers frequently appear in ordinary Korean communication. To investigate this further, we introduce the Syntactically Incomplete Korean (SIKO) dataset. Through SIKO, we assessed Korean language models' flexibility with incomplete syntax and confirmed the dataset's training value. Results indicate these models reflect Korean's inherent flexibility, accurately handling incomplete inputs. Moreover, fine-tuning with SIKO enhances the ability to handle common incomplete Korean syntactic forms. The dataset's simple construction process, coupled with significant performance enhancements, solidifies its standing as an effective data augmentation technique.

Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers

TL;DR

This paper investigates whether Korean language models capture the language's inherent syntactic flexibility, particularly case-marker-driven word-order variations, and whether incomplete-syntax data can augment model performance. It introduces the SIKO dataset, generating five syntactically incomplete variants (CMCMdel, ShufShufSem.Presrv, ShufShufSem.Non.Presrv, and mixes) from KLUE-TC, KLUE-NLI, and AI-Hub dialogues, with 2,000 test and 20,000 training instances. Through inference experiments and fine-tuning studies across PKO-T5, Ko-GPT-Trinity, and chatGPT, the authors show that LMs reflect Korean's syntactic flexibility and that SIKO-based fine-tuning consistently improves handling of incomplete inputs, often outperforming standard data augmentation methods. The work demonstrates a cost-effective data augmentation technique that enhances robustness in Korean NLP and suggests potential applicability to other case-marker languages with flexible word order. Overall, SIKO provides a practical path to bolster LM performance on real-world Korean text featuring incomplete syntax.

Abstract

Syntactic elements, such as word order and case markers, are fundamental in natural language processing. Recent studies show that syntactic information boosts language model performance and offers clues for people to understand their learning mechanisms. Unlike languages with a fixed word order such as English, Korean allows for varied word sequences, despite its canonical structure, due to case markers that indicate the functions of sentence components. This study explores whether Korean language models can accurately capture this flexibility. We note that incomplete word orders and omitted case markers frequently appear in ordinary Korean communication. To investigate this further, we introduce the Syntactically Incomplete Korean (SIKO) dataset. Through SIKO, we assessed Korean language models' flexibility with incomplete syntax and confirmed the dataset's training value. Results indicate these models reflect Korean's inherent flexibility, accurately handling incomplete inputs. Moreover, fine-tuning with SIKO enhances the ability to handle common incomplete Korean syntactic forms. The dataset's simple construction process, coupled with significant performance enhancements, solidifies its standing as an effective data augmentation technique.
Paper Structure (58 sections, 3 figures, 13 tables)

This paper contains 58 sections, 3 figures, 13 tables.

Figures (3)

  • Figure 1: An example of Korean syntax flexibility: a) Case markers enable word order variability, b) They can be omitted in canonical sequences.
  • Figure 2: ChatGPT prompt and version information
  • Figure 3: Example of chatGPT prompt usage