Table of Contents
Fetching ...

HiligayNER: A Baseline Named Entity Recognition Model for Hiligaynon

James Ald Teves, Ray Daniel Cal, Josh Magdiel Villaluz, Jean Malolos, Mico Magtira, Ramon Rodriguez, Mideth Abisado, Joseph Marvin Imperial

TL;DR

HiligayNER tackles the lack of public NER resources for Hiligaynon by introducing the first openly available NER corpus and baselines. It compiles 8,082 cleaned sentences from diverse sources and fine-tunes two multilingual transformers, mBERT and XLM-RoBERTa, achieving macro F1 scores around 0.86–0.88 and strong Person-name recognition. The study also evaluates zero-shot cross-lingual transfer to Cebuano and Tagalog, with F1 around 0.46, indicating useful transferability for related Central Philippine languages. By releasing data, model checkpoints, and scripts, the work provides a reproducible foundation for Hiligaynon NLP and paves the way for broader low-resource language processing in the region.

Abstract

The language of Hiligaynon, spoken predominantly by the people of Panay Island, Negros Occidental, and Soccsksargen in the Philippines, remains underrepresented in language processing research due to the absence of annotated corpora and baseline models. This study introduces HiligayNER, the first publicly available baseline model for the task of Named Entity Recognition (NER) in Hiligaynon. The dataset used to build HiligayNER contains over 8,000 annotated sentences collected from publicly available news articles, social media posts, and literary texts. Two Transformer-based models, mBERT and XLM-RoBERTa, were fine-tuned on this collected corpus to build versions of HiligayNER. Evaluation results show strong performance, with both models achieving over 80% in precision, recall, and F1-score across entity types. Furthermore, cross-lingual evaluation with Cebuano and Tagalog demonstrates promising transferability, suggesting the broader applicability of HiligayNER for multilingual NLP in low-resource settings. This work aims to contribute to language technology development for underrepresented Philippine languages, specifically for Hiligaynon, and support future research in regional language processing.

HiligayNER: A Baseline Named Entity Recognition Model for Hiligaynon

TL;DR

HiligayNER tackles the lack of public NER resources for Hiligaynon by introducing the first openly available NER corpus and baselines. It compiles 8,082 cleaned sentences from diverse sources and fine-tunes two multilingual transformers, mBERT and XLM-RoBERTa, achieving macro F1 scores around 0.86–0.88 and strong Person-name recognition. The study also evaluates zero-shot cross-lingual transfer to Cebuano and Tagalog, with F1 around 0.46, indicating useful transferability for related Central Philippine languages. By releasing data, model checkpoints, and scripts, the work provides a reproducible foundation for Hiligaynon NLP and paves the way for broader low-resource language processing in the region.

Abstract

The language of Hiligaynon, spoken predominantly by the people of Panay Island, Negros Occidental, and Soccsksargen in the Philippines, remains underrepresented in language processing research due to the absence of annotated corpora and baseline models. This study introduces HiligayNER, the first publicly available baseline model for the task of Named Entity Recognition (NER) in Hiligaynon. The dataset used to build HiligayNER contains over 8,000 annotated sentences collected from publicly available news articles, social media posts, and literary texts. Two Transformer-based models, mBERT and XLM-RoBERTa, were fine-tuned on this collected corpus to build versions of HiligayNER. Evaluation results show strong performance, with both models achieving over 80% in precision, recall, and F1-score across entity types. Furthermore, cross-lingual evaluation with Cebuano and Tagalog demonstrates promising transferability, suggesting the broader applicability of HiligayNER for multilingual NLP in low-resource settings. This work aims to contribute to language technology development for underrepresented Philippine languages, specifically for Hiligaynon, and support future research in regional language processing.

Paper Structure

This paper contains 14 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall methodology of developing HiligayNER using annotated news articles, social media posts, and literary text datasets in Hiligaynon using Transformer architectures mBERT and XLM-RoBERTa.
  • Figure 2: Training loss, validation loss, and F1 score per training step for the finetuned mBERT model.
  • Figure 3: Training loss, validation loss, and F1 score per training step for the finetuned XLM-RoBERTa model.
  • Figure 4: Confusion matrix of the finetuned mBERT model using HiligayNER across NER categories, omitting the OTH tag for brevity.
  • Figure 5: Confusion matrix of the finetuned XLM-RoBERTa model using HiligayNER across NER categories, omitting the OTH tag for brevity.