Zero Resource Cross-Lingual Part Of Speech Tagging
Sahil Chopra
TL;DR
This work tackles zero-resource POS tagging by transferring English annotations to French, German, and Spanish through word-alignment projection and training a Hidden Markov Model on the projected data. The pipeline uses OPUS-MT translations, SimAlign word alignment, and an easy projection scheme with Viterbi decoding under a universal 12-tag POS set, comparing projected-data training (GD) to gold-annotated training (AD). Results show that projected data yields tangible improvements over no-target-label baselines but remains below fully supervised performance, with notable deficits on pronouns, proper nouns, and conjunctions due to alignment and data balance issues. The study demonstrates the viability of projection-based zero-resource POS tagging as a practical baseline for low-resource languages and highlights concrete bottlenecks related to alignment quality and token-level divergences.
Abstract
Part of speech tagging in zero-resource settings can be an effective approach for low-resource languages when no labeled training data is available. Existing systems use two main techniques for POS tagging i.e. pretrained multilingual large language models(LLM) or project the source language labels into the zero resource target language and train a sequence labeling model on it. We explore the latter approach using the off-the-shelf alignment module and train a hidden Markov model(HMM) to predict the POS tags. We evaluate transfer learning setup with English as a source language and French, German, and Spanish as target languages for part-of-speech tagging. Our conclusion is that projected alignment data in zero-resource language can be beneficial to predict POS tags.
