Table of Contents
Fetching ...

Surfing the modeling of PoS taggers in low-resource scenarios

Manuel Vilares Ferro, Víctor M. Darriba Bilbao, Francisco J. Ribadas-Pena, Jorge Graña Gil

TL;DR

This work tackles model selection for PoS tagging in low-resource NLP by leveraging early estimation of learning curves to guide the choice of non-deep learners. It formalizes the approach using a learning framework with a kernel-based learning scheme ${\mathcal{D}}^{\mathcal{K}}_{\sigma}$, an accuracy pattern ${\pi}$ (power-law form), and learning traces ${\mathcal{A}}_{\ell}^{\pi}[{\mathcal{D}}^{\mathcal{K}}_{\sigma}]$, and it assesses reliability and robustness via a standardized testing frame. The case study on Galician XiADA demonstrates that the method yields reliable predictions of model performance across a diverse set of taggers, with quantitative MAPE values generally low and qualitative DMR results close to perfect, indicating stable decision-making. The findings support the practicality of early learning-curve-based model selection in low-resource settings and suggest broad applicability to additional languages and NLP tasks, potentially reducing annotation and computation costs without sacrificing performance.

Abstract

The recent trend towards the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, in particular low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operationalenvironment. Using as case study the generation of PoS taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.

Surfing the modeling of PoS taggers in low-resource scenarios

TL;DR

This work tackles model selection for PoS tagging in low-resource NLP by leveraging early estimation of learning curves to guide the choice of non-deep learners. It formalizes the approach using a learning framework with a kernel-based learning scheme , an accuracy pattern (power-law form), and learning traces , and it assesses reliability and robustness via a standardized testing frame. The case study on Galician XiADA demonstrates that the method yields reliable predictions of model performance across a diverse set of taggers, with quantitative MAPE values generally low and qualitative DMR results close to perfect, indicating stable decision-making. The findings support the practicality of early learning-curve-based model selection in low-resource settings and suggest broad applicability to additional languages and NLP tasks, potentially reducing annotation and computation costs without sacrificing performance.

Abstract

The recent trend towards the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, in particular low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operationalenvironment. Using as case study the generation of PoS taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.
Paper Structure (23 sections, 10 equations, 5 figures, 1 table)

This paper contains 23 sections, 10 equations, 5 figures, 1 table.

Figures (5)

  • Figure S1: Learning curve for svmtool on xiada, and an accuracy pattern fitting it.
  • Figure S2: Learning trace for svmtool on xiada, with details in zoom.
  • Figure S3: Working and prediction levels for svmtool on xiada, with details in zoom.
  • Figure S4: mapes, rrs and dmrs for runs.
  • Figure S5: Learning trends for the best and worst mapes.

Theorems & Definitions (7)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7