Table of Contents
Fetching ...

Language Models are Universal Embedders

Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang

TL;DR

This work investigates whether multilingual decoder-only LLMs (notably BLOOM) can be turned into universal embedders that work across natural and programming languages and multiple tasks. It introduces a simple embedding scheme using the last EOS token and trains with a contrastive InfoNCE objective, aided by parameter-efficient fine-tuning (LoRA/BitFit) to preserve multilingual capabilities. By combining symmetric and asymmetric data across languages and tasks, the approach achieves strong cross-language and cross-task generalization, including partial generalization to unseen languages, and demonstrates robustness across domains in extended evaluations. Java performance highlights data limitations in code-related tasks and underscores the need for broader data coverage. Overall, the results support the potential of open-source, universal embedders derived from multilingual LLMs, with promising implications for retrieval, classification, and cross-language information access.

Abstract

In the large language model (LLM) revolution, embedding is a key component of various systems, such as retrieving knowledge or memories for LLMs or building content moderation filters. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is advantageous to build a unified embedding model rather than dedicated ones for each scenario. In this context, the pre-trained multilingual decoder-only large language models, e.g., BLOOM, emerge as a viable backbone option. To assess their potential, we propose straightforward strategies for constructing embedders and introduce a universal evaluation benchmark. Experimental results show that our trained model is proficient at generating good embeddings across languages and tasks, even extending to languages and tasks for which no finetuning/pretraining data is available. We also present detailed analyses and additional evaluations. We hope that this work could encourage the development of more robust open-source universal embedders.

Language Models are Universal Embedders

TL;DR

This work investigates whether multilingual decoder-only LLMs (notably BLOOM) can be turned into universal embedders that work across natural and programming languages and multiple tasks. It introduces a simple embedding scheme using the last EOS token and trains with a contrastive InfoNCE objective, aided by parameter-efficient fine-tuning (LoRA/BitFit) to preserve multilingual capabilities. By combining symmetric and asymmetric data across languages and tasks, the approach achieves strong cross-language and cross-task generalization, including partial generalization to unseen languages, and demonstrates robustness across domains in extended evaluations. Java performance highlights data limitations in code-related tasks and underscores the need for broader data coverage. Overall, the results support the potential of open-source, universal embedders derived from multilingual LLMs, with promising implications for retrieval, classification, and cross-language information access.

Abstract

In the large language model (LLM) revolution, embedding is a key component of various systems, such as retrieving knowledge or memories for LLMs or building content moderation filters. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is advantageous to build a unified embedding model rather than dedicated ones for each scenario. In this context, the pre-trained multilingual decoder-only large language models, e.g., BLOOM, emerge as a viable backbone option. To assess their potential, we propose straightforward strategies for constructing embedders and introduce a universal evaluation benchmark. Experimental results show that our trained model is proficient at generating good embeddings across languages and tasks, even extending to languages and tasks for which no finetuning/pretraining data is available. We also present detailed analyses and additional evaluations. We hope that this work could encourage the development of more robust open-source universal embedders.
Paper Structure (42 sections, 1 equation, 5 figures, 15 tables)

This paper contains 42 sections, 1 equation, 5 figures, 15 tables.

Figures (5)

  • Figure 1: The performance comparison of finetuned BLOOM models on our compiled universal embedding benchmark, details refer to Table \ref{['tab:main']}.
  • Figure 2: The outline of our main evaluation process. We finetune BLOOM to generate embeddings by [EOS] with contrastive loss on monolingual data, and analyze performance by multilingual tests from various tasks. The solid lines in the graph show English as an example.
  • Figure 3: The plot of monolingual score (a), crosslingual averaged score (b), and their difference (c) of natural language evaluations on all$\rightarrow$all setting. The lower the ratio of a language in pre-training, the lower its performance, and the more significant the improvement brought by training data.
  • Figure 4: Visualization of 100 examples from CodeSearchNet Python, where Chinese texts are translated by GPT-3.5-turbo. Gold and pink markers represent parallel sequences in different languages. Before finetuning, (a), embeddings are separated by language, especially English and Chinese. After English finetuning, (b), the parallel sequences are well aligned to each other.
  • Figure 5: The plot of English (en), French (fr), Spanish (es), Chinese (zh) from Table \ref{['tab:lang-asym']}, where en, fr and es are all in the Indo-European family and with similar performance trends. While the zh trained model shows differences to Indo-European ones in es, fr, and ar.