Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe

Alicja Ziarko; Albert Q. Jiang; Bartosz Piotrowski; Wenda Li; Mateja Jamnik; Piotr Miłoś

Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe

Alicja Ziarko, Albert Q. Jiang, Bartosz Piotrowski, Wenda Li, Mateja Jamnik, Piotr Miłoś

TL;DR

This paper study how to contrastively train text embedding models in a compute-optimal fashion, given a suite of pre-trained decoder-only language models, and suggests that full fine-tuning and low-rank adaptation fine-tuning produce optimal models at lower and higher computational budgets respectively.

Abstract

Text embeddings are essential for many tasks, such as document retrieval, clustering, and semantic similarity assessment. In this paper, we study how to contrastively train text embedding models in a compute-optimal fashion, given a suite of pre-trained decoder-only language models. Our innovation is an algorithm that produces optimal configurations of model sizes, data quantities, and fine-tuning methods for text-embedding models at different computational budget levels. The resulting recipe, which we obtain through extensive experiments, can be used by practitioners to make informed design choices for their embedding models. Specifically, our findings suggest that full fine-tuning and low-rank adaptation fine-tuning produce optimal models at lower and higher computational budgets respectively.

Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe

TL;DR

Abstract

Paper Structure (31 sections, 8 equations, 23 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 23 figures, 4 tables, 1 algorithm.

Introduction
Related work
Preliminaries
Scaling laws and compute-optimal models
Extracting representations from transformers
Contrastive fine-tuning
Fine-tuning methods
Calculating computational cost
Experiments
Experimental setup
Experimental results for different methods
Scaling laws for embeddings
Compute-optimal frontier and recipe
Generalisation
Takeaways
...and 16 more sections

Figures (23)

Figure 1: The optimal loss achieved using four different fine-tuning methods (full fine-tuning, only tuning the bias, low-rank adaptation, and freezing transformer blocks) at given budgets. The horizontal axis is the computational budget in floating point operations (FLOP) and the vertical axis is the contrastive loss. The X marks are datapoints and dotted lines are fitted linear trends for different methods. The solid black line is the "optimal frontier," i.e., the optimal loss achievable with a fixed budget and the best method.
Figure 2: (a) IsoFLOP profiles for full fine-tuning. The horizontal axis is the number of parameters in the model, and the vertical axis is the achieved loss. Both axes use log-scale. The optimal model size tends to increase as the computational budget increases. (b) IsoFLOP profiles for block freezing. The axes are the same as for full fine-tuning. Each data point denotes the optimal choice with respect to the fraction of active blocks during training, which is noted above the points. The optimal model size tends to increase as the computational budget increases, while the optimal active block fraction tends to slightly decrease as the model size gets larger.
Figure 3: The effect of block freezing across all model sizes. Different colours signify different computational budgets. Unless the model is large and the computational budget small, it is always better to update all the (non-embedding) weights of the model.
Figure 4: (a) IsoFLOP profiles for bias-only tuning. The horizontal axis is the number of parameters in the model, and the vertical axis is the achieved loss. Both axes use log-scale. The optimal model size increases as the computational budget increases, but the achievable loss is higher than for other fine-tuning methods. (b) IsoFLOP profiles for LoRA fine-tuning. The axes are the same as for bias-only tuning. Each data point denotes the optimal choice of the rank of LoRA matrices given the size of the model and the computational budget.
Figure 5: The effect of different LoRA ranks across all model sizes. Different colours signify different computational budgets. The inflected curves indicate that it is less beneficial to use a rank from either extremes of the spectrum (8 or 2048). The detrimental effect of the high rank of 2048 is stronger for lower computational budgets. Ranks of 32 and 128 result in the lowest loss overall.
...and 18 more figures

Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe

TL;DR

Abstract

Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe

Authors

TL;DR

Abstract

Table of Contents

Figures (23)