Table of Contents
Fetching ...

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

David Demitri Africa, Suchir Salhan, Yuval Weiss, Paula Buttery, Richard Diehl Martinez

TL;DR

This paper tackles zero-shot cross-lingual NER for low-resource Philippine languages by meta-pretraining small decoder LMs with a first-order MAML objective, aiming to produce fast-adapting representations without exposure to Tagalog or Cebuano. The authors implement a hybrid pretraining regime on Pico decoders across four sizes and attach an untrained CRF head for high-resource finetuning before zero-shot evaluation on Tagalog and Cebuano. They report consistent zero-shot micro-F1 gains (2–6 points head-only, 1–3 points full-tuning), with the largest improvements observed for single-token person entities and in surface-anchored cues like Tagalog case particles; gains are more pronounced at smaller models and tend to diminish with scale. Qualitative analyses reveal that meta-pretraining sharpens lexical prototypes and enhances reliance on surface cues, while also identifying limitations related to multi-token entities and capacity constraints, suggesting avenues for broader language coverage and alternative meta-objectives.

Abstract

Named-entity recognition (NER) in low-resource languages is usually tackled by finetuning very large multilingual LMs, an option that is often infeasible in memory- or latency-constrained settings. We ask whether small decoder LMs can be pretrained so that they adapt quickly and transfer zero-shot to languages unseen during pretraining. To this end we replace part of the autoregressive objective with first-order model-agnostic meta-learning (MAML). Tagalog and Cebuano are typologically similar yet structurally different in their actor/non-actor voice systems, and hence serve as a challenging test-bed. Across four model sizes (11 M - 570 M) MAML lifts zero-shot micro-F1 by 2-6 pp under head-only tuning and 1-3 pp after full tuning, while cutting convergence time by up to 8%. Gains are largest for single-token person entities that co-occur with Tagalog case particles si/ni, highlighting the importance of surface anchors.

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

TL;DR

This paper tackles zero-shot cross-lingual NER for low-resource Philippine languages by meta-pretraining small decoder LMs with a first-order MAML objective, aiming to produce fast-adapting representations without exposure to Tagalog or Cebuano. The authors implement a hybrid pretraining regime on Pico decoders across four sizes and attach an untrained CRF head for high-resource finetuning before zero-shot evaluation on Tagalog and Cebuano. They report consistent zero-shot micro-F1 gains (2–6 points head-only, 1–3 points full-tuning), with the largest improvements observed for single-token person entities and in surface-anchored cues like Tagalog case particles; gains are more pronounced at smaller models and tend to diminish with scale. Qualitative analyses reveal that meta-pretraining sharpens lexical prototypes and enhances reliance on surface cues, while also identifying limitations related to multi-token entities and capacity constraints, suggesting avenues for broader language coverage and alternative meta-objectives.

Abstract

Named-entity recognition (NER) in low-resource languages is usually tackled by finetuning very large multilingual LMs, an option that is often infeasible in memory- or latency-constrained settings. We ask whether small decoder LMs can be pretrained so that they adapt quickly and transfer zero-shot to languages unseen during pretraining. To this end we replace part of the autoregressive objective with first-order model-agnostic meta-learning (MAML). Tagalog and Cebuano are typologically similar yet structurally different in their actor/non-actor voice systems, and hence serve as a challenging test-bed. Across four model sizes (11 M - 570 M) MAML lifts zero-shot micro-F1 by 2-6 pp under head-only tuning and 1-3 pp after full tuning, while cutting convergence time by up to 8%. Gains are largest for single-token person entities that co-occur with Tagalog case particles si/ni, highlighting the importance of surface anchors.

Paper Structure

This paper contains 40 sections, 15 figures, 7 tables, 2 algorithms.

Figures (15)

  • Figure 1: Scale curve. Zero-shot Micro-F$_1$ on Cebuano & Tagalog versus parameter count. Bars compare Pico-MAML (blue) to vanilla pretraining (green); the overlaid line shows the relative gain of MAML (Delta F1, right axis). Meta-pretraining helps at every scale, but the relative lift shrinks from +38 % (11 M) to +6 % (570 M), revealing a capacity threshold below which the inner loop cannot extract reusable features.
  • Figure 2: Impact of finetuning regime. Head-only tuning (left) magnifies the meta-learning advantage up to +2.5 pp at 570 M, likely because the backbone must already encode entity cues. Full tuning (right) reduces but does not erase the gap, suggesting that MAML primarily accelerates convergence rather than acting as a regulariser.
  • Figure 3: Sensitivity to finetuning language. Grid of zero-shot F$_1$ curves after adapting on nine high-resource languages plus an All-languages mixture. Eight of nine languages show positive deltas; the largest relative gains occur for Slovak and Croatian, while Simplified Chinese is the lone outlier (–2 pp). This pattern indicates that the meta-objective encourages reliance on surface affixes and particles that generalise well across Indo-European sources yet still transfer to Austronesian targets.
  • Figure 4: Learning curves for the Slovak head-only setting. Top: train loss; bottom: eval micro-F$_1$. Faint green lines = all individual checkpoints; bold line = median; shaded band = 25–75 % IQR. Both metrics converge monotonically and remain tightly bunched, indicating a stable optimisation surface for the linear head.
  • Figure 5: Final metrics vs. pretraining checkpoint for the medium MAML backbone frozen during head-only finetuning on Slovak. Top: final train loss of the CRF head, every run converges to the same narrow range. Bottom: final micro-F$_1$ on Slovak dev (blue), Tagalog (green) and Cebuano (yellow). Although in-language performance saturates early, cross-lingual F$_1$ keeps improving up to step 6000, indicating that later meta-updates learn representations useful specifically for zero-shot transfer.
  • ...and 10 more figures