Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
Jianyu Zheng
TL;DR
This paper tackles the scarcity of POS-annotated data in low-resource languages by introducing a fully unsupervised cross-lingual POS tagging framework that relies solely on monolingual corpora. It uses unsupervised neural machine translation to generate pseudo-parallel data from high-resource to low-resource languages, followed by projection-based POS tagging and a novel multi-source calibration to improve tag accuracy. Across 28 language pairs, the approach achieves POS tagging performance on par with baselines that require parallel corpora and even surpasses them for several language pairs, with the multi-source projection providing additional gains. This work demonstrates that effective cross-lingual POS tagging can be achieved without parallel data, enabling scalable tagging for typologically diverse and resource-scarce languages.
Abstract
Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
