Table of Contents
Fetching ...

BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

Alexis Matzopoulos, Charl Hendriks, Hishaam Mahomed, Francois Meyer

TL;DR

The paper investigates data-efficient language modelling for low-resource languages by applying BabyLM architectures to isiXhosa. Two architectures, ELC-BERT and MLSM, are pretrained on a 13M-word isiXhosa corpus and evaluated on POS tagging, NER, and NTC against a RoBERTa baseline and multilingual skylines; results show clear gains on POS and NER, with ELC-BERT achieving the largest improvement (NER +3.2 F1) and sometimes approaching XLM-R performance. However, none of the BabyLMs surpass all skylines, and NTC remains unaffected, highlighting the critical role of high-quality pretraining data. The work underscores that architectural innovations can yield data-efficient gains for low-resource languages, but stresses the need for better, developmentally plausible corpora to realize full potential. Additionally, the paper provides qualitative analyses of how ELC-BERT and MLSM encode isiXhosa, linking architectural signals to downstream task performance.

Abstract

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

TL;DR

The paper investigates data-efficient language modelling for low-resource languages by applying BabyLM architectures to isiXhosa. Two architectures, ELC-BERT and MLSM, are pretrained on a 13M-word isiXhosa corpus and evaluated on POS tagging, NER, and NTC against a RoBERTa baseline and multilingual skylines; results show clear gains on POS and NER, with ELC-BERT achieving the largest improvement (NER +3.2 F1) and sometimes approaching XLM-R performance. However, none of the BabyLMs surpass all skylines, and NTC remains unaffected, highlighting the critical role of high-quality pretraining data. The work underscores that architectural innovations can yield data-efficient gains for low-resource languages, but stresses the need for better, developmentally plausible corpora to realize full potential. Additionally, the paper provides qualitative analyses of how ELC-BERT and MLSM encode isiXhosa, linking architectural signals to downstream task performance.

Abstract

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.
Paper Structure (20 sections, 5 figures, 2 tables)

This paper contains 20 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Downstream task performance for model checkpoints at different stages of pretraining.
  • Figure 2: Layer contribution heatmaps of isiXhosa ELC-BERT at different stages of pretraining.
  • Figure 3: Top 10 semantic categories predicted by isiXhosa MLSM for named entities (sampled from MasakhaNER).
  • Figure 4: Layer contribution heatmaps of isiXhosa ELC-BERT at different stages of pretraining.
  • Figure 5: Top 10 semantic categories predicted by isiXhosa MLSM for target words (sampled from MasakhaPOS).