Table of Contents
Fetching ...

Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement

Hakyung Sung, Gyu-Ho Shin

TL;DR

This work addresses the reliability of morphosyntactic analysis for learner Korean by expanding the L2-Korean UD treebank to v1.2 with 12,984 sentences, and by extensively refining annotation guidelines to align with UD while capturing Korean-specific features. It implements left-to-right coordination, left-headed flat structures, and a constrained set of auxiliary verbs, with annotations produced by multiple native speakers and quantified reliability. The authors fine-tune four diverse parsers (baseline Stanza, fine-tuned Stanza, spaCy, and Trankit) and evaluate them on in-domain and KoLLA out-of-domain data, finding that Trankit consistently offers the best XPOS, UAS, and LAS, while Stanza excels at LEMMA. Overall, the results demonstrate the importance of high-quality, domain-diverse L2 data for improving morphosyntactic analysis when fine-tuning first-language models, with practical implications for UD-based learner corpora and future cross-domain generalization.

Abstract

We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement

TL;DR

This work addresses the reliability of morphosyntactic analysis for learner Korean by expanding the L2-Korean UD treebank to v1.2 with 12,984 sentences, and by extensively refining annotation guidelines to align with UD while capturing Korean-specific features. It implements left-to-right coordination, left-headed flat structures, and a constrained set of auxiliary verbs, with annotations produced by multiple native speakers and quantified reliability. The authors fine-tune four diverse parsers (baseline Stanza, fine-tuned Stanza, spaCy, and Trankit) and evaluate them on in-domain and KoLLA out-of-domain data, finding that Trankit consistently offers the best XPOS, UAS, and LAS, while Stanza excels at LEMMA. Overall, the results demonstrate the importance of high-quality, domain-diverse L2 data for improving morphosyntactic analysis when fine-tuning first-language models, with practical implications for UD-based learner corpora and future cross-domain generalization.

Abstract

We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

Paper Structure

This paper contains 15 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Coordination (Left-headed) 'I looked around and ate some foods.'
  • Figure 2: Flat (Left-headed) 'Youngsoo is good at tennis.'