Table of Contents
Fetching ...

The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

Angelina A. Aquino, Lester James V. Miranda, Elsie Marie T. Or

TL;DR

UD-NewsCrawl introduces the largest Tagalog UD treebank to date (15,619 sentences) and documents a rigorous, multi-stage annotation pipeline with quality-control protocols to produce UD-compliant syntax, morphology, and dependency labels. The work provides baseline transformer-based parsers across multiple representations, demonstrating that multilingual context-sensitive models like XLM-RoBERTa yield strong performance on Tagalog while enabling cross-treebank evaluation. Analyses include quality assessment, cross-treebank generalization, and topic classification, revealing domain biases and the limits of cross-lingual transfer for Tagalog. The dataset and baselines offer a valuable resource for Tagalog NLP, guiding annotation practices for underrepresented languages and informing future efforts to broaden linguistic coverage and UD guideline adaptation.

Abstract

This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.

The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

TL;DR

UD-NewsCrawl introduces the largest Tagalog UD treebank to date (15,619 sentences) and documents a rigorous, multi-stage annotation pipeline with quality-control protocols to produce UD-compliant syntax, morphology, and dependency labels. The work provides baseline transformer-based parsers across multiple representations, demonstrating that multilingual context-sensitive models like XLM-RoBERTa yield strong performance on Tagalog while enabling cross-treebank evaluation. Analyses include quality assessment, cross-treebank generalization, and topic classification, revealing domain biases and the limits of cross-lingual transfer for Tagalog. The dataset and baselines offer a valuable resource for Tagalog NLP, guiding annotation practices for underrepresented languages and informing future efforts to broaden linguistic coverage and UD guideline adaptation.

Abstract

This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.

Paper Structure

This paper contains 66 sections, 1 equation, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Example sentences illustrating features of Tagalog voice marking under a symmetrical voice analysis: each sentence has a different subject (preceded by the ang marker) and a different verbal affix denoting the thematic role (av = agent, pv = patient, lv = locative) of the subject, but all three examples are pragmatically equivalent to the English sentence "The man gave flowers to the woman."
  • Figure 2: Annotation workflow for UD-NewsCrawl.
  • Figure 3: Topic distribution of UD-NewsCrawl using categories from SIB-200 adelani-etal-2024-sib.
  • Figure 4: Embedding map generated using the Nomic Atlas API for fine-grained topic classification.
  • Figure 5: Prompt template for topic classification.
  • ...and 1 more figures