Table of Contents
Fetching ...

Large Language Models Are Overparameterized Text Encoders

Thennal D K, Tim Fischer, Chris Biemann

TL;DR

This paper shows that by pruning the last $p\%$ layers of an LLM before supervised training for only 1000 steps, it can achieve a proportional reduction in memory and inference time, and proposes a novel layer-pruning strategy based on the model's initial loss.

Abstract

Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last $p\%$ layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30\% of layers with negligible impact on performance and up to 80\% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose $\text{L}^3 \text{Prune}$, a novel layer-pruning strategy based on the model's initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21\% of the parameters with a $-0.3$ performance drop, and the small variant only suffers from a $-5.1$ decrease while pruning 74\% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.

Large Language Models Are Overparameterized Text Encoders

TL;DR

This paper shows that by pruning the last layers of an LLM before supervised training for only 1000 steps, it can achieve a proportional reduction in memory and inference time, and proposes a novel layer-pruning strategy based on the model's initial loss.

Abstract

Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30\% of layers with negligible impact on performance and up to 80\% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose , a novel layer-pruning strategy based on the model's initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21\% of the parameters with a performance drop, and the small variant only suffers from a decrease while pruning 74\% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The loss values extracted from the layerwise embeddings of samples from the training dataset. These values were obtained for each unmodified model. The marked points indicate the layer with minimal loss before and after the midpoint.
  • Figure 2: The training loss curves for each model at different pruning percentages.
  • Figure 3: The final loss values at the end of training across different pruning percentages.
  • Figure 4: A simplified illustration of L3Prune. The initial loss of the representation of each layer is found, and the two minima before and after 50% of the model correspond to the layers to prune to in the two configurations, small and large.
  • Figure 5: The MTEB (15 task subset) scores with respect to the number of model parameters.
  • ...and 1 more figures