Table of Contents
Fetching ...

Bootstrapping Embeddings for Low Resource Languages

Merve Basoz, Andrew Horne, Mattia Opper

TL;DR

It is found that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.

Abstract

Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.

Bootstrapping Embeddings for Low Resource Languages

TL;DR

It is found that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.

Abstract

Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
Paper Structure (24 sections, 2 equations, 12 figures, 21 tables)

This paper contains 24 sections, 2 equations, 12 figures, 21 tables.

Figures (12)

  • Figure 1: Pipeline overview: data is synthesised using an LLM, the result of which is then used to finetune an encoder, resulting in the final embedding model.
  • Figure 2: XL-LoRA training data construction. Starting with high quality translations we generate positives and negatives based on English, before swapping back the anchor to the original non-English language. Resulting examples are used to finetune the XL-LoRA generator.
  • Figure 3: Retrieval results across multiple benchmarks. Results are averaged across backbones (XLM-R and mmBERT) and across languages. Metric is recall@10. Full results for all tasks and backbones can be found in Appendix \ref{['subsec:retrieval_eval']}, but reflect the same clear trend depicted here.
  • Figure 4: Synthetic Turkish data produced via ICL-Prompting. Unintended code switching and poor intellegibility are apparent throughout.
  • Figure 5: Lexical overlap between the anchor and positive/negative pairs computed via Dice coefficient. English results, including gold human annotations, are shown on the left. Right compares adapter composition with prompted data in low resource languages. XL-LoRA is excluded as languages differ within the triplet.
  • ...and 7 more figures