Table of Contents
Fetching ...

Data Augmentation and Hyperparameter Tuning for Low-Resource MFA

Alessio Tosolini, Claire Bowern

TL;DR

Low-resource languages suffer from degraded forced-alignment accuracy due to limited data. The paper investigates data augmentation and hyperparameter tuning for the Montreal Forced Aligner (MFA) using about 10 hours of Australian Indigenous language data, including the Yidiny and Big5 corpora. It finds audio augmentation provides limited gains, while systematic hyperparameter tuning—especially increasing monophone iterations and selecting an effective 22-class triphone grouping—yields substantial improvements, enabling strong alignment with as little as about 30 minutes of data. The results suggest that carefully configured monolingual models can match or exceed cross-language-adapted performance, offering a practical route for researchers documenting endangered languages.

Abstract

A continued issue for those working with computational tools and endangered and under-resourced languages is the lower accuracy of results for languages with smaller amounts of data. We attempt to ameliorate this issue by using data augmentation methods to increase corpus size, comparing augmentation to hyperparameter tuning for multilingual forced alignment. Unlike text augmentation methods, audio augmentation does not lead to substantially increased performance. Hyperparameter tuning, on the other hand, results in substantial improvement without (for this amount of data) infeasible additional training time. For languages with small to medium amounts of training data, this is a workable alternative to adapting models from high-resource languages.

Data Augmentation and Hyperparameter Tuning for Low-Resource MFA

TL;DR

Low-resource languages suffer from degraded forced-alignment accuracy due to limited data. The paper investigates data augmentation and hyperparameter tuning for the Montreal Forced Aligner (MFA) using about 10 hours of Australian Indigenous language data, including the Yidiny and Big5 corpora. It finds audio augmentation provides limited gains, while systematic hyperparameter tuning—especially increasing monophone iterations and selecting an effective 22-class triphone grouping—yields substantial improvements, enabling strong alignment with as little as about 30 minutes of data. The results suggest that carefully configured monolingual models can match or exceed cross-language-adapted performance, offering a practical route for researchers documenting endangered languages.

Abstract

A continued issue for those working with computational tools and endangered and under-resourced languages is the lower accuracy of results for languages with smaller amounts of data. We attempt to ameliorate this issue by using data augmentation methods to increase corpus size, comparing augmentation to hyperparameter tuning for multilingual forced alignment. Unlike text augmentation methods, audio augmentation does not lead to substantially increased performance. Hyperparameter tuning, on the other hand, results in substantial improvement without (for this amount of data) infeasible additional training time. For languages with small to medium amounts of training data, this is a workable alternative to adapting models from high-resource languages.

Paper Structure

This paper contains 9 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Mean differences for AugBig5 and Big5 datasets for different triphone groupings.
  • Figure 2: Mean differences for AugYidiny and Yidiny datasets for different triphone groupings.
  • Figure 3: Mean differences for different amounts of monophone iterations.
  • Figure 4: Mean differences (Big5 models) across configurations
  • Figure 5: Mean differences (Yidiny) across configurations