Table of Contents
Fetching ...

Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only

Jianyu Zheng

TL;DR

This paper tackles the scarcity of POS-annotated data in low-resource languages by introducing a fully unsupervised cross-lingual POS tagging framework that relies solely on monolingual corpora. It uses unsupervised neural machine translation to generate pseudo-parallel data from high-resource to low-resource languages, followed by projection-based POS tagging and a novel multi-source calibration to improve tag accuracy. Across 28 language pairs, the approach achieves POS tagging performance on par with baselines that require parallel corpora and even surpasses them for several language pairs, with the multi-source projection providing additional gains. This work demonstrates that effective cross-lingual POS tagging can be achieved without parallel data, enabling scalable tagging for typologically diverse and resource-scarce languages.

Abstract

Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.

Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only

TL;DR

This paper tackles the scarcity of POS-annotated data in low-resource languages by introducing a fully unsupervised cross-lingual POS tagging framework that relies solely on monolingual corpora. It uses unsupervised neural machine translation to generate pseudo-parallel data from high-resource to low-resource languages, followed by projection-based POS tagging and a novel multi-source calibration to improve tag accuracy. Across 28 language pairs, the approach achieves POS tagging performance on par with baselines that require parallel corpora and even surpasses them for several language pairs, with the multi-source projection providing additional gains. This work demonstrates that effective cross-lingual POS tagging can be achieved without parallel data, enabling scalable tagging for typologically diverse and resource-scarce languages.

Abstract

Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
Paper Structure (25 sections, 1 equation, 6 figures, 7 tables)

This paper contains 25 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The steps of commonly used unsupervised cross-lingual POS tagging methods: (a) zero-shot cross-lingual transfer using multilingual pre-trained language models; (b) POS tag projection through word alignment with parallel corpus.
  • Figure 2: The fully unsupervised cross-lingual part-of-speech (POS) tagging framework, which does not rely on parallel sentence pairs. This framework consists four steps: 1) constructing pseudo-parallel sentence pairs; 2) generating the training instances for the target language; 3) calibrating the annotation results through the multi-source projection technique; 4) training a neural POS tagger for the target language.
  • Figure 3: The effect of the number of pseudo-parallel sentence pairs on the accuracy of POS tagger for target languages. Seven language pairs are chosen for this experiment.
  • Figure 4: The accuracy of POS taggers for the seven target languages across each POS category. For each target language, the average accuracy of POS taggers trained on the four source languages is reported. The first row displays the accuracy for four content word categories (noun, verb, adjective and pronoun); while the second row shows the accuracy for four function word categories (preposition, auxiliary word, coordinating conjunction, and determiner).
  • Figure 5: Accuracy of POS taggers for seven target languages on multi-category words. (a), (b) and (c) display the accuracy results for "verb&noun", "verb&adjective" and "noun&adjective", respectively, while (d) shows the accuracy for all multi-category words.
  • ...and 1 more figures