Table of Contents
Fetching ...

Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann

TL;DR

The paper tackles natural language processing for 30 low-resource languages by exploring parameter-efficient adapter-based adaptation of small multilingual models (mBERT, XLM-R) with unstructured text (GlotCC) and structured knowledge (ConceptNet). It evaluates three adapter architectures (Sequential Bottleneck, Invertible Bottleneck, and LoRA) across MLM and four downstream tasks, showing that modest adaptation data (up to 1 GB text or a few MB of KG data) yields meaningful gains. Smaller mLMs with adapters can outperform large LLM prompting in many LRL scenarios, though pre-training coverage remains a dominant factor. The work demonstrates that ConceptNet can boost NER while GlotCC provides broad improvements, and it highlights a moderate link between MLM quality and downstream task performance, with adaptation data offering diminishing returns for languages with extensive pre-training.

Abstract

Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.

Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

TL;DR

The paper tackles natural language processing for 30 low-resource languages by exploring parameter-efficient adapter-based adaptation of small multilingual models (mBERT, XLM-R) with unstructured text (GlotCC) and structured knowledge (ConceptNet). It evaluates three adapter architectures (Sequential Bottleneck, Invertible Bottleneck, and LoRA) across MLM and four downstream tasks, showing that modest adaptation data (up to 1 GB text or a few MB of KG data) yields meaningful gains. Smaller mLMs with adapters can outperform large LLM prompting in many LRL scenarios, though pre-training coverage remains a dominant factor. The work demonstrates that ConceptNet can boost NER while GlotCC provides broad improvements, and it highlights a moderate link between MLM quality and downstream task performance, with adaptation data offering diminishing returns for languages with extensive pre-training.

Abstract

Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.

Paper Structure

This paper contains 49 sections, 6 figures, 27 tables.

Figures (6)

  • Figure 1: Correlation between the pre-training data sizes for mBERT and XLM-R and downstream task results for the pre-adaptation and post-adaptation results. The vertical bars indicate the amounts of adaptation data. The improvements in downstream performance for both models are primarily concentrated in languages with smaller pre-training data sizes, which are positioned on the left side of the plots. Conversely, for languages with substantial representation in the pre-training data, the improvements are less pronounced or nonexistent (Section \ref{['sec:task_data']}).
  • Figure 2: Correlation between the pre-training data sizes for mBERT and XLM-R and the pseudo-perplexities with the values fit in the log-space for the pre-adaptation and post-adaptation results.
  • Figure 3: Correlation between the downstream performance for mBERT and XLM-R pre- and post-adaptation and the pseudo-perplexities.
  • Figure 4: Correlation between the downstream performance for mBERT and XLM-R and the pre-training data and adaptation data.
  • Figure 5: Correlation between the downstream performance for mBERT and XLM-R and the pre-training data and adaptation data.
  • ...and 1 more figures