Surfing the modeling of PoS taggers in low-resource scenarios

Manuel Vilares Ferro; Víctor M. Darriba Bilbao; Francisco J. Ribadas-Pena; Jorge Graña Gil

Surfing the modeling of PoS taggers in low-resource scenarios

Manuel Vilares Ferro, Víctor M. Darriba Bilbao, Francisco J. Ribadas-Pena, Jorge Graña Gil

TL;DR

This work tackles model selection for PoS tagging in low-resource NLP by leveraging early estimation of learning curves to guide the choice of non-deep learners. It formalizes the approach using a learning framework with a kernel-based learning scheme ${\mathcal{D}}^{\mathcal{K}}_{\sigma}$, an accuracy pattern ${\pi}$ (power-law form), and learning traces ${\mathcal{A}}_{\ell}^{\pi}[{\mathcal{D}}^{\mathcal{K}}_{\sigma}]$, and it assesses reliability and robustness via a standardized testing frame. The case study on Galician XiADA demonstrates that the method yields reliable predictions of model performance across a diverse set of taggers, with quantitative MAPE values generally low and qualitative DMR results close to perfect, indicating stable decision-making. The findings support the practicality of early learning-curve-based model selection in low-resource settings and suggest broad applicability to additional languages and NLP tasks, potentially reducing annotation and computation costs without sacrificing performance.

Abstract

The recent trend towards the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, in particular low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operationalenvironment. Using as case study the generation of PoS taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.

Surfing the modeling of PoS taggers in low-resource scenarios

TL;DR

, an accuracy pattern

(power-law form), and learning traces

, and it assesses reliability and robustness via a standardized testing frame. The case study on Galician XiADA demonstrates that the method yields reliable predictions of model performance across a diverse set of taggers, with quantitative MAPE values generally low and qualitative DMR results close to perfect, indicating stable decision-making. The findings support the practicality of early learning-curve-based model selection in low-resource settings and suggest broad applicability to additional languages and NLP tasks, potentially reducing annotation and computation costs without sacrificing performance.

Abstract

Paper Structure (23 sections, 10 equations, 5 figures, 1 table)

This paper contains 23 sections, 10 equations, 5 figures, 1 table.

Introduction
Related Work and Contribution
The Formal Framework
The notational support
Correctness
Robustness
The Testing Frame
The monitoring structure
The performance metrics
Measuring the reliability
The quantitative perspective
The qualitative perspective
Measuring the robustness
The Experiments
The linguistics resources
...and 8 more sections

Figures (5)

Figure S1: Learning curve for svmtool on xiada, and an accuracy pattern fitting it.
Figure S2: Learning trace for svmtool on xiada, with details in zoom.
Figure S3: Working and prediction levels for svmtool on xiada, with details in zoom.
Figure S4: mapes, rrs and dmrs for runs.
Figure S5: Learning trends for the best and worst mapes.

Theorems & Definitions (7)

Definition 1
Definition 2
Definition 3
Definition 4
Definition 5
Definition 6
Definition 7

Surfing the modeling of PoS taggers in low-resource scenarios

TL;DR

Abstract

Surfing the modeling of PoS taggers in low-resource scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (7)